Thursday, July 24, 2014

Wolpert: "The Lack of A Priori Distinctions Between Learning Algorithms" (1996)

I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function f: X → Y when evaluated on input values that did not occur in our training data. We do so by selecting an estimate hX → Y, hoping that h(x) will agree with f(x) on the points x we haven't yet seen.

If there no restrictions on how f can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever h we choose, Wolpert can always choose f so as to disagree with every single choice that our h makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length m, and your goal is to predict the next n binary digits in the sequence. If all of the 2n + m possible sequences are in your hypothesis space, then the mutual information between the test and training set is
I(train; test)  =  H(test)  –  H(test | train)  =  2n  –  2n + m/2m  =  0.
In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:
  • A loss measure L is "homogenous" if our choice of the estimate h is independent of the sum of the loss probabilities given the correct answers, that is,
    Σy Pr{L = c | y}.
  • For finite |X| and |Y|, we can define a learning problem as uniform if all of the |Y||X| possible functions from X to Y are equally probable.
  • When we use a homogenous loss function in a uniform learning problem, then
    Σy Pr{L = c | y}  =  1/|Y| Σy Pr{L = c | y} Pr{y}  =  1/|Y| Pr{L = c}
    is independent of h, and therefore E[L | h] = E[L].
Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for f(x), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function L(f(x), h(x)) = |f(x) – h(x)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that L = 4, for instance, depends heavily on your estimator h, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He thus states that if we are using a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:
Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.) 
However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)
But Wolpert is confusing two criteria of success here. There is a difference between identifying a long-run frequency — which is the goal in both Bernoulli's theorem and the VC theorem — and predicting the exact outcomes of a series of experiments.

The various laws of large numbers give worst-case guarantees against misidentifying a random process, but they don't say that you'll eventually be able to predict it. Even if you know with certainty that some process is a series of independent coin flips, you have no chance of guessing the next outcome above chance level.

This is not because low empirical error on the frequency estimation mischaracterizes your error on the test data. Your empirical error is just going to be uniformly high when the input sequence is a series of coin flips (assuming that all your hypotheses model the data-generating process as i.d.d.). So unless you introduce unwarranted dependence assumptions, the data will quickly reveal that there is no structure to be discovered, and thus that no hypothesis will ever do well on exact prediction.

No comments :

Post a Comment