The Myth of the Data-Driven Discovery: A Popperian Critique of Induction

Most people think science comes from “induction”, which is the process of distilling laws from repeated observations.

With AI/LLMs, the current popular belief suggests that if we simply feed enough data into a powerful enough processor, the “truth” will emerge as a statistical necessity.

But knowledge is not extracted from the world through observation; it is created within the mind through bold, imaginative conjecture and rigorous criticism.

The Illusion of Induction

Induction is traditionally defined as the inference of a general law from particular instances. If you observe 1,000 white swans, induction suggests you are justified in “inducing” the law that “all swans are white.” In modern science, this is often rebranded as “Bayesian inference” or “pattern recognition.” We believe that by observing the positions of the planets or the behavior of subatomic particles, we can “calculate” the laws of nature.

People believe science works this way because it feels intuitive: we see a pattern, and we project it into the future. It provides a sense of certainty and a “recipe” for discovery. If induction were real, science would be a mechanical process of data collection. But as Karl Popper and David Deutsch argue, this fails to account for how every major scientific revolution actually occurred. It is also a fundamental misconception of how knowledge grows. The strongest critique of induction is that it is physically and logically impossible.

The Searchlight: Why Observation Follows Theory

The first failure of induction is that “pure observation” does not exist. To observe, one must first decide what is worth looking at. If a scientist is told to “record observations,” they must immediately choose a frame of reference: Are they recording the temperature, the color of the walls, or the position of Jupiter?

Popper proposed the Searchlight Theory: Our theories act as searchlights that illuminate specific parts of reality. We do not gather data and then form a theory; we form a “conjecture” (a creative guess) and then use data to test it. Even the most “data-driven” AI must be programmed with a “model” or a “cost function” that tells it which patterns to prioritize. The theory always comes first.

The Infinite Curve: The Failure of Pattern Recognition

Even if we could collect “pure” data, induction offers no way to choose between competing theories. This is known as the Curve-Fitting Problem. For any finite set of data points, there are an infinite number of mathematical curves that can pass through them.

An inductive machine (or a “High-Temperature LLM,” as discussed by Dwarkesh Patel and Terence Tao) might find a pattern that fits past data perfectly but fails the moment a new variable is introduced. For instance, before Einstein, centuries of data “induced” the law that time is absolute. It took a creative leap—a conjecture that defied all previous “patterns”—to realize that time is relative. Induction would never have produced General Relativity because General Relativity broke the very patterns induction would have relied upon.

The “Hard-to-Vary” Criterion

David Deutsch takes Popper’s critique a step further by identifying what makes a theory “good.” It isn’t its “probability” or how well it fits a trend line; it is how hard it is to vary.

A theory like “the gods are angry” can explain a drought, but it can also explain a flood; it is “easy to vary” and therefore explains nothing. Conversely, Kepler’s laws or the DNA double helix are “hard to vary.” Every piece of the explanation is functionally necessary to the whole. Induction cannot produce hard-to-vary explanations because it only deals with correlations (what happens), not explanations (why it happens).

The Black Swan and the Creative Leap

Finally, the logic of induction fails because the future does not have to resemble the past. No matter how many white swans you see, the statement “all swans are white” is a guess about the unseen.

Knowledge grows through Conjectures and Refutations. We make a wild, creative leap to explain a problem, and then we use the “verification loop”—which is actually a falsification loop—to try and kill the idea. If the idea survives our best attempts to prove it wrong, we tentatively accept it as “knowledge.”

Current AI/LLMs are “probability engines” that excel at mimicking the past. But true science is the history of people who looked at the “proven” patterns of the past and dared to conjecture that they were wrong. As long as we view science as an inductive process, we miss the most important ingredient of progress: the human ability to create something that has never been seen before.