# II. Occam’s Fat-Shattering Razor

The Intelligent Design folks ardently point out the miraculous nature of life despite being labeled as pseudoscientists by the scientific community at large. However, no can one deny that the amazing order we see in biological systems does have the feel of some sort of intelligent design, scientifically true or not. The trouble is that these folks postulate an Intelligent Designer is behind all these miracles. In fact, it is possible that they are correct, but, there is a problem with this kind of hypothesis: it can be used to explain anything! If we ask “how did plants come to use photosynthesis as a source of energy?” we answer: “the Designer designed it that way”. And, if we ask “how did the eye come to exist in so many animal species?”, again, we can only get “the Designer designed it that way”. The essential problem is that this class of hypotheses has infinite complexity.

“It may seem natural to think that, to understand a complex system, one must construct a model incorporating everything that one knows about the system. However sensible this procedure may seem, in biology it has repeatedly turned out to be a sterile exercise. There are two snags with it. The first is that one finishes up with a model so complicated that one cannot understand it: the point of a model is to simplify, not to confuse. The second is that if one constructs a sufficiently complex model one can make it do anything one likes by fiddling with the parameters: a model that can predict anything predicts nothing.” – John Maynard Smith and Eörs Szathmáry (Hat tip Gregory Chaitin)

The field of learning theory forms the foundation of machine learning. It contains the secret sauce that is behind many of the amazing artificial intelligence applications today. This list includes achieving image recognition on par with humans, self-driving cars, Jeopardy! champion Watson, and the amazing 9-dan Go program AlphaGo [see Figure 2]. These achievements shocked people all over the world – how far and how fast artificial intelligence had advanced. Half of this secret sauce is a sound mathematical understanding of complexity in computer models (a.k.a. hypotheses) and how to measure it. In effect learning theory has quantified the philosophical principal of Occam’s razor which says that the simplest explanation is the correct one – we can now measure the complexity of explanations. Early discoveries in the 1970’s produced the concept of the VC dimension (also known as the “fat-shattering” dimension) named for its discoverers, Vladimir Vapnik and Alexey Chervonenkis. This property of a hypothesis class measures the number of observations that it is guaranteed to be able to explain. Recall a polynomial with, say, 11 parameters, such as:

$P(x)=c_0+c_1x^1+c_2x^2+c_3x^3+c_4x^4+c_5x^5+c_6x^6+c_7x^7+c_8x^8+c_9x^9+c_{10}x^{10}$

can be fit to any 11 data points [see Figure 1]. This function is said to have a VC dimension of 11. Don’t expect this function to find any underlying patterns in the data though! When a function with this level of complexity is fit to an equal number of data points it is likely to over-fit. The key to having a hypothesis generalize well, that is, make predictions that are likely to be correct, is having it explain a much greater number of observations than its complexity.

Figure 1: Noisy (roughly linear) data is fitted to both linear and polynomial functions. Although the polynomial function is a perfect fit, the linear version can be expected to generalize better. In other words, if the two functions were used to extrapolate the data beyond the fit data, the linear function would make better predictions. Image and caption b

Nowadays measures of complexity have become much more acute: the technique of margin-maximization in support vector machines, regularization in neural networks and others have had the effect of reducing the effective explanatory power of a hypothesis class, thereby limiting its complexity, and causing the model to make better predictions. Still, the principal is the same: the key to a hypothesis making accurate predictions is about managing its complexity relative to explaining known observations. This principal applies whether we are trying to learn how to recognize handwritten digits, how to recognize faces, how to play Go, how to drive a car, or how to identify “beautiful” works of art. Further, it applies to all mathematical models that learn inductively, that is, via examples, whether machine or biological. When a model fits the data with a reasonable complexity relative to the number of observations then we are confident it will generalize well. The model has come to “understand” the data in a sense.

Figure 2: The game of Go. The AI application AlphaGo defeated one of the best human Go players, Lee Sedol, 4 games to 1 in March,2016 by Goban1 via Wikimedia Commons.

The hypothesis of Intelligent Design, simply put, has infinite VC dimension, and, therefore can be expected to have no predictive power, and that is what we see – unless, of course, we can query the Designer J! But, before we jump on Darwin’s bandwagon we need to face a very grim fact: the hypothesis class characterized by “we must have learned that during X billion years of evolution” also has the capacity to explain just about anything! Just think of the zillions of times this has been referenced, almost axiom-like, in the journals of scientific research!