Kepler: the first data scientist

 

Kepler: the first data scientist

What can one of the greatest geniuses in physics teach us about data analysis and artificial intelligence?

by Francisco Rodrigues, University of São Paulo

When Johannes Kepler (1571–1630) arrived at Benátky Castle near Prague in 1600 to work with the Danish astronomer Tycho Brahe (1546–1601), he had no idea of the wealth of data that awaited him. Although Brahe was considered one of the greatest astronomers in history, renowned for his attention to detail and his quest for precision in measuring the positions and movements of celestial bodies, he was extremely competitive and jealous of other astronomers. He also had a difficult temper and was prone to fits of rage. Nevertheless, Kepler accepted the challenge of working with Brahe, believing that the rewards would outweigh the challenges.

However, just 18 months after Kepler began working at the observatory, Tycho Brahe died suddenly in Prague on 24 October 1601, probably from a urinary infection. According to historians, before he died Tycho asked Kepler: ‘Do not let my efforts be in vain’. Brahe recognised Kepler’s extraordinary talent and wanted to ensure that his decades of meticulous observations would not go to waste. These observations, made with the naked eye, were an extraordinary feat as the telescope had not yet been pointed at the night sky by Galileo.

Tycho Brahe provided the observations that made Kepler’s discoveries possible. Note that Brahe wore an artificial nose, having lost the upper part of his nose in a duel with another nobleman over a mathematical dispute in 1566.

With full access to the data, Kepler saw an opportunity to confirm Copernicus’s ideas that the Earth was just a planet moving around the Sun. He had been introduced to Copernican theories by his teacher and mentor, Michael Maestlin, one of the few 16th-century astronomers who fully supported the heliocentric hypothesis. By the time Kepler joined the University of Tübingen in 1589, Copernicus’ ideas were taboo in Lutheran intellectual circles. Like the Catholic Church, Martin Luther categorically rejected the idea that the earth revolved around the Sun. The prevailing belief at the time was that the Earth was stationary at the centre of the universe, surrounded by the planets and the sun.

‘After long investigations I finally became convinced that the Sun is a fixed star surrounded by planets that revolve around it, and that it is the centre and the flame.’ — Nicolaus Copernicus

Kepler was convinced that the planets orbited the sun in circular orbits, as Copernicus had proposed. Although Copernicus’ predictions didn’t fit perfectly with the observations available at the time, Kepler was confident that this was just a problem with the data, not with the heliocentric model itself. So by looking at Brahe’s measurements, Kepler was sure that he would find circular orbits and perfect geometric patterns. Since the time of Pythagoras, it had been widely accepted that the universe was governed by harmonious mathematical laws, with circles symbolising perfection and symmetry. For Kepler, if God had created the universe, it was natural to expect it to reflect these perfect forms, in accordance with the divine order.

‘All my life I’ve wanted to be a theologian. I suffered a lot with this unexpected change of direction. But now I have finally realised that I can praise God in another way, through my work in astronomy.’ — Kepler, in a letter to his tutor Maestlin,

Kepler began to analyse the data collected by Tycho Brahe and, using the mathematics available at the time, fitted a mathematical model that could predict the orbit of Mars. To his surprise, the orbits were not circular, as he had expected, but elliptical. An ellipse is an elongated and imperfect circle, something that profoundly contradicted his expectations. Influenced by his Pythagorean and religious beliefs, Kepler wanted to find more harmonious and symmetrical solutions.

Not satisfied with this result, Kepler decided to analyse the data of other planets such as Mercury, Venus and Jupiter. Once again, he found that their orbits were elliptical. Reluctantly, he accepted what the data showed and formulated his first law, the law of elliptical orbits, which states that the planets move in elliptical orbits with the Sun at one of the foci. Although imperfect, Kepler’s discovery showed that the universe could be described by mathematics.

“The immense usefulness of mathematics in the natural sciences borders on the mysterious, and there is no rational explanation for it.” — Eugene Wigner.

Kepler continued to analyse the data and, through trial and error, discovered two other fundamental laws. The second, called the law of areas, states that the line connecting a planet to the Sun sweeps out equal areas at equal intervals, indicating that planets move faster when they are closer to the Sun. The third law, the law of periods, states that the square of a planet’s orbital period is proportional to the cube of its mean distance from the Sun, giving a precise mathematical relationship between a planet’s orbital period and its orbital distance. In other words, an indication that the universe, in all its immensity, follows a perfect melody written in mathematical equations.

Illustration of Kepler’s second law: the line connecting a planet to the Sun sweeps over equal areas at equal intervals, implying that planets move faster when they are closer to the Sun. Source: Wikipedia.

Although they demonstrated the regularity of the universe, which could be described mathematically, these laws again did not reflect the divine perfection that Kepler sought. They showed that the planets did not move around the sun at a constant speed, but that their speed varied according to their position in elliptical orbits. Despite his religious beliefs, Kepler accepted the data as reflecting the natural order of the universe. These three laws were later fundamental to Isaac Newton’s formulation of the law of universal gravitation, which linked the motion of the planets to the force of gravity.

Kepler continued his search for perfect geometric patterns by analysing the distance of the planets from the Sun. He still believed that there was a geometric harmony that reflected divine perfection in the creation of the universe. After exploring various possibilities, he finally found a pattern. In three spatial dimensions, Kepler knew that there were only five perfect solids, solids whose faces are made up of regular and congruent polygons (identical in shape and size, with all angles and sides equal). For example, the cube is a perfect (or Platonic) solid because its six faces are squares. Kepler believed that these five solids could be fitted together like matryoshka dolls, with an imaginary sphere between each two solids, representing the orbits of the planets. Since five solids only allow for six spheres, he thought that these spheres might have something to do with the orbits of the six known planets. To his delight, Kepler found with surprising precision that by adjusting the arrangements, the spheres coincided with the distances between the planets. Kepler thus explained why there were only six known planets and what determined their distance from the Sun. This discovery confirmed Kepler’s belief that there was indeed a perfect geometric pattern in the universe.

The Platonic solids in Kepler’s Mysterium Cosmographicum. Source: Wikipedia.

However, with the discovery of new planets (Saturn, Uranus and Neptune) many years later, Kepler’s model proved to be limited. Fortunately, he didn’t live to see these new discoveries and died believing in his theory. At least that was one of the few joys of a scientist whose life was marked by tragedy: his father, a mercenary who abused him; his mother, almost burnt at the Inquisition; and several children who died in infancy. Kepler lived in illness and poverty, but he found happiness in science.

“A scientist is happy, not in resting on his attainments but in the steady acquisition of fresh knowledge.” — Max Planck

Kepler: the data scientist

Now that we know a bit about Kepler’s life, how does he relate to data science? First of all, Kepler was one of the first scientists to use data to adapt a theory. Unlike the ancient Greeks, such as Aristotle, Kepler based his laws on empirical data. For Aristotle, knowledge should be based on qualitative observation, logic and deductive reasoning, rather than on systematic experimentation or rigorous measurement as we understand it today.

Furthermore, although he was very religious and had strong beliefs, Kepler based his conclusions entirely on data, albeit reluctantly. He accepted that the orbits of the planets were not spheres and that the planets did not move around the Sun at a constant speed. In other words, the data prevailed over his beliefs.

Kepler then tried to find patterns in the data. He didn’t have the methods and models we have today, such as the method of least squares introduced by Gauss more than a century later, or regression models and multivariate statistics. He also didn’t have computers or even calculators. In other words, he used only trial and error, guided by mathematical and physical principles, mainly geometry.

The scientific method involves observing phenomena, formulating hypotheses and testing theories to understand and predict the behaviour of nature. Kepler was one of the first to apply this approach systematically, integrating empirical data and mathematical analysis to validate his discoveries.

Kepler approached his work like a detective, developing hypotheses based on the observational evidence he gathered. A confirmed hypothesis wasn’t considered universal unless it could be applied consistently to other planets. To test this, Kepler first fitted his model to data from Mars and then checked its applicability to Mercury, Venus and Jupiter. This ensured that his model generalised to new data. Through this meticulous process, Kepler formulated his groundbreaking laws, using a pioneering approach for his time — one that combined empirical observations with rigorous mathematical calculations to describe the motion of the planets.

In short, Kepler analysed the data, identified patterns (mathematical equations) and then verified that the model held true for new data. He set aside his personal beliefs and published his results critically and impartially, allowing others to verify his findings. This is the typical work of a data scientist.

Symbolic regression

In addition to the scientific methodology he developed, Kepler also inspired artificial intelligence. In 1992, John R. Koza, in his book ‘Genetic Programming: On the Programming of Computers by Means of Natural Selection’, introduced a way to automate the discovery of mathematical expressions using genetic programming, an extension of the famous genetic algorithms. The aim was to find a symbolic expression, made up of mathematical operations, variables and constants, that minimised error in relation to the data provided. In other words, it was possible to automatically find the mathematical expression that best described the data. In short, this can be seen as an automation of Kepler’s work.

Later, Michael Schmidt and Hod Lipson developed more advanced and efficient methods for finding these equations, such as the Eureqa algorithm. This method became known as symbolic regression, which attempts to find mathematical relationships between variables from raw data in a process similar to Kepler’s, but with the help of modern computing tools.

Whereas in traditional regression the aim is to infer the parameters of a previously defined model so that it best fits the data, in symbolic regression the aim is to discover both the functional structure of the model and its parameters. This allows the algorithm to automatically find the mathematical equation that best describes the relationship between the variables, without needing to know the form of the model beforehand. In this way, we have a method for finding mathematical laws that is directly inspired by the work of Kepler, who also sought to discover the laws underlying the motion of the planets from observational data, without imposing a predetermined functional form.

The expression tree is a data structure that can be used in symbolic regression to represent a mathematical function. Each internal node of the tree represents a mathematical operation (such as addition, multiplication, division, etc.), while the leaves of the tree contain the variables and constants. The final expression of the function is obtained by traversing the tree, applying the operations to the internal nodes, based on the values of the leaves. Think about how the equation in the picture can be obtained by following the components of the tree. Source: Wikipedia.

Symbolic regression has been applied to a wide range of problems in science, engineering and technology, particularly in contexts where the relationships between variables are not well understood or where interpretable mathematical models are sought. In physics, for example, it has been used to identify mathematical relationships underlying physical phenomena directly from experimental data. In biology, it can help identify differential equations that describe the behaviour of populations of cells or pathogens. In economics, it is used to discover equations that predict market movements or identify relationships between economic indicators. Essentially, we supply the data and mathematical operators to the model, which then finds the most likely equations. In this way, symbolic regression complements and enhances scientific work.

The artificial scientist

An interesting question that arises with symbolic regression is: since it automates the discovery process, could it replace the work of a scientist? At first glance, the answer seems to be yes, but the reality is more complex. In physical systems, for example, fundamental laws such as the conservation of energy and momentum must be respected, and many adjustments using the symbolic regression method may lead to solutions that are not appropriate or physically feasible. Ockham’s razor reminds us to balance precision and interpretability when choosing models, which means that overly simple solutions may fail to capture the complexity of phenomena.

Therefore, symbolic regression does not replace the scientist, but acts as a powerful tool that enhances their research capabilities. It automates parts of the scientific process, but the stages of hypothesis formulation, theoretical and experimental validation, and creative interpretation remain, at the current state of the art, an exclusively human domain.

Symbolic regression can therefore be seen as a modern tool that extends the kind of work done by Kepler, making it easier to identify laws and relationships more efficiently and often with less human intervention. It allows scientists to discover mathematical equations that can model phenomena observed in various fields such as physics, biology and other disciplines. Although it is still evolving, symbolic regression already shows enormous potential to provide the interpretability needed to tackle complex problems and drive scientific progress.

In a recent study, for example, researchers demonstrated that it is possible to automatically predict equations from Richard Feynman’s book. The neural network algorithm, called AI Feynman, was able to identify all 1000 equations tested. However, the algorithm cannot determine whether the laws are physically correct. In other words, the algorithm doesn’t recognise any laws of physics. But perhaps it’s only a matter of time. Only time will tell if we’ll be able to replace Kepler.

If you’re curious about my research, visit this link: https://sites.icmc.usp.br/francisco.

See you next time!

Comentários

Postagens mais visitadas deste blog

Kepler: o primeiro cientista de dados