Jekyll2021-10-13T20:23:16+00:00http://blog.claytonsanford.com/feed.xmlClayton’s BlogRandom thoughts about machine learning, algorithms, math, running, living in New York, and more.Clayton Sanford[OPML#7] BLN20 & BS21: Smoothness and robustness of neural net interpolators2021-09-22T00:00:00+00:002021-09-22T00:00:00+00:00http://blog.claytonsanford.com/2021/09/22/bubeck<p><em>This is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>This post discusses two papers by Sebastian Bubeck and his collaborators that are of interest to the study of over-parameterized neural networks. The first, <a href="https://arxiv.org/abs/2009.14444" target="_blank">“A law of robustness for two-layers neural networks” (BLN20)</a> with Li and Nagaraj, gives a conjecture about the “robustness” of a two-layer neural network that interpolates all of the training data. The second, <a href="https://arxiv.org/abs/2105.12806" target="_blank">“A universal law of robustness via isoperimetry” (BS21)</a> with Sellke, proves part of the conjecture and extends that part of the conjecture to deeper neural networks.
The other part of the conjecture remains open for future work to tackle.</p>
<p>Both papers consider a setting where there are \(n\) training samples \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}\) drawn from some distribution that are fit by a neural network with \(k\) neurons.
For the two-layer case (which we’ll focus on in this writeup), they consider neural networks of the form</p>
\[f(x) = \sum_{j=1}^k u_j \sigma(w_j^T x + b_j),\]
<p>where \(\sigma(t) = \max(0, t)\) is the ReLU activation function and \(w_j \in \mathbb{R}^d\) and \(b_j, u_j \in \mathbb{R}\) are the parameters.
Roughly, they ask whether there exists a “smooth” neural network \(f\) such that \(f(x_i) \approx y_j\) for all \(j \in [n]\); this makes \(f\) an approximate interpolator.</p>
<p><em>How does this relate to the rest of this blog series?</em>
All of the other posts so far have been about cases where over-parameterized linear regression leads to favorable generalization performance.
These generalization results occur due to the smoothness of the linear prediction rule.
That is, if we have some prediction rule \(x \mapsto \beta^T x\) for \(x, \beta \in \mathbb{R}^d\) with \(d \gg n\), we might have good generalization if \(\|\beta\|_2\) is small, which is enabled when \(d\) is very large.
The same observation holds up with neural networks (over-parameterized models leads to benign overfitting), but it’s harder to prove why it leads to a small generalization error.
By understanding the smoothness of interpolating neural networks, it might make it easier to prove generalization bounds on the neural networks that perfectly fit the training data.</p>
<p><em>How do they measure smoothness?</em>
For linear regression, it’s natural to think of the smoothness of the prediction rule \(f_{\text{lin}}(x) = \beta^T x\) as \(\|\beta\|_2\), since that is the magnitude of the gradient \(\|\nabla f_{\text{lin}}(x)\|_2\) at every sample \(x\).
For two-layer neural networks—which are non-linear functions—it’s natural instead to consider the maximum norm of the gradient of \(f\), which is represented by the Lipschitz constant of \(f\): the minimum \(L\) such that \(|f(x) - f(x')| \leq L \|x - x'\|_2\) for all \(x, x'\). (Lipschitzness also comes up frequently in my <a href="/2021/08/15/hssv21.html" target="_blank">COLT paper about the approximation capabilities of shallow neural networks</a>.)</p>
<p><em>What does it have to do with robustness?</em>
Typically, robustness is discussed in the context of adversarial examples.
If you’ve hung around the ML community, you’ve probably seen this issue featured in images like this:</p>
<p><img src="/assets/images/2021-09-22-bubeck/panda.png" alt="" /></p>
<p>Here, an image of a panda is provided that a trained image classification neural network clearly identifies as such.
However, a small amount of noise can be added to the image that leads to the network being tricked into thinking that it’s a gibbon instead.
Put roughly, it means that the network outputs \(f(x) = \text{"panda"}\) and \(f(x + \epsilon \tilde{x}) = \text{"gibbon"}\) for some \(x\) and \(\tilde{x}\), which means that the output of \(f\) changes greatly near \(x\).
By mandating that \(f\) have a small Lipschitz constant, these kinds of fluctuations are impossible.
This makes the network \(f\) <em>robust</em>.
Thus, enforcing smoothness conditions is a way to ensure that a predictor is robust to these kinds of adversarial examples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/smooth.jpeg" alt="" /></p>
<p>As a result, Bubeck and his collaborators want to characterize the availability of interpolating networks \(f\) that are also robust, with the hopes of understanding how over-parameterization can be used to avoid having adversarial examples.</p>
<p>One important caveat: Unlike the previous papers discussed in this series, this one focuses only on approximation and not optimization.
It asks whether <em>there exists</em> an interpolating prediction rule that is smooth, but it does not ask whether this rule can be easily obtained from stochastic gradient descent.</p>
<p>For the rest of the post, I’ll discuss the conjecture made by BLN20, share the support for the conjecture that was provided by BLN20 and BS21, and discuss what remains to be studied in this space.</p>
<h2 id="the-conjecture">The conjecture</h2>
<p>For simplicity, BLN20 considers only samples drawn uniformly from the unit sphere: \(x \in \mathbb{S}^{d-1}= \{x \in \mathbb{R}^d: \|x\|_2=1\}\) with iid labels \(y_i \sim \text{Unif}(\{-1,1\})\).
The conjecture of BLN20, which combines their Conjectures 1 and 2 is as follows:</p>
<p><em>Consider some \(k \in [\frac{cn}{d}, Cn]\) for constants \(c\) and \(C\). With high probability over \(n\) random samples from some distribution, there exists a 2-layer neural network \(f\) of width \(k\) that perfectly fits the data such that \(f\) is \(O(\sqrt{n/k})\)-Lipschitz.
Furthermore, any neural network that fits the data must be \(\Omega(\sqrt{n/k})\)-Lipschitz with high probability.</em></p>
<p>If true, the conjecture suggests there can only be an \(O(1)\)-Lipschitz interpolating neural network \(f\) if the model is highly over-parameterized, or \(k = \Omega(n)\).
Note that \(k\) is the number of neurons, and not the number of parameters.
In the case of a 2-layer neural network, the number of parameters is \(p = kd\), so there must be at least \(p = \Omega(nd)\) parameters for the interpolating network to be smooth.</p>
<p>The conditions with constants \(c\) and \(C\) are necessary for the question to be well-posed.</p>
<ul>
<li>Without the \(k \leq Cn\) constraint, there theorem would imply the existence of neural networks that fit the data and are \(o(1)\)-Lipschitz. However, this is not possible unless all training samples are have the same label \(y_i\); otherwise, there are at least two different samples \(x_i\) and \(x_j\) that are at most distance 2 apart (since both lie on \(\mathbb{S}^{d-1}\)) and have opposite labels. This implies that any function fitting both samples must be at least 1-Lipschitz.</li>
<li>Without the \(k \geq \frac{cn}{d}\) constant, there is unlikely to any neural network with \(k\) neurons that can fit the \(n\) samples. Since the number of parameters \(p\) is roughly \(kd\), letting \(k \ll \frac{n}{d}\) would ensure that \(p \ll n\) and there are fewer parameters than samples. Intuitively, it’s difficult to fit a large number of points with random labels when there are fewer parameters than samples. This suggests that the model must be over-parameterized for interpolation to even occur in the first place, let alone be smooth.</li>
</ul>
<p>BLN20 shows that the conjecture holds up empirically on toy data.
For many values of \(n\) and \(k\), they train several neural networks to fit the \(n\) samples with 2-layer neural networks of width \(k\) and randomly sample gradients to find the one with the largest magnitude.
When plotted, they note a nice linear relationship between the norms of the largest random gradient and \(\sqrt{n/k}\).
Of course, the maximum random gradient is not the same as the Lipschitz constant, since it’s impossible to check the gradient for all values of \(x\) simultaneously, but this suggests that it’s likely that the conjecture is correct.</p>
<p><img src="/assets/images/2021-09-22-bubeck/plot.png" alt="" /></p>
<h2 id="partial-upper-bounds-from-bln20">Partial upper bounds from BLN20</h2>
<p>The BLN20 papers focuses on presenting the conjecture and giving a series of partial results that suggest it may be true. In this section, we give a brief summary of each of the partial solutions.</p>
<p>The following are all partial solutions to the upper bound. That is, they show weaker versions of the claim that there exists neural network \(f\) with Lipschitz constant \(O(\sqrt{n/ k})\) by showing either larger bounds on the Lipschitz constant or more restrictive parameter regimes.</p>
<ul>
<li><strong>The high-dimensional case (3.1).</strong> If \(d \gg n\), then a ReLU network with a single neuron \(k = 1\) can be used to perfectly fit the data.
This is because a single \(d\)-dimensional hyperplane will be able to fit the \(n\) samples, so one can just choose the hyperplane with the lowest magnitude that fits the data and use a ReLU that corresponds to that hyperplane. By similar analysis to that of linear regression, the Lipschitz constant of this network will be \(O(\sqrt{n})\) with high probability, which is the same as \(O(\sqrt{n/ k})\). This can’t be improved without using more neurons.
<img src="/assets/images/2021-09-22-bubeck/single.jpeg" alt="" /></li>
<li><strong>The wide (“optimal size”) regime: \(k = n\) (3.2).</strong> With high probability, an \(10\)-Lipschitz network \(f\) can be provided by using a ReLU for every sample. Each ReLU is treated as a “cap” that gives a sample the correct label. With high probability, the points will be sufficiently spread apart in \(\mathbb{S}^{d-1}\) to ensure that none of the the caps overlap. This makes the norm of the gradient never more than \(10\), if each cap is offset by \(\frac{1}{10}\).
<img src="/assets/images/2021-09-22-bubeck/cap.jpeg" alt="" /></li>
<li><strong>The compromise case (3.3).</strong> The two previous approaches can be combined for a broader choice of \(k\) and \(n\) by instead having each ReLU perfectly fit \(m := n/k \leq d\) samples in a cap. However, since these are bigger and more complex caps then before, we need to be more concerned about the caps overlapping. They show that \(O(m \log d)\) caps will overlap at any given point, which means that the Lipschitz constant will be \(O(n\log (d) / k)\). Even disregarding the logarithmic factor, this is still much weaker than the \(O(\sqrt{d/k})\) factor that the conjecture desires.
<img src="/assets/images/2021-09-22-bubeck/combo.jpeg" alt="" /></li>
<li><strong>The very low-dimensional case with a weird architecture (3.4).</strong>
They prove the existence of a neural network that fits \(n\) samples and has Lipschitz constant \(O(\sqrt{n / k})\) with high probability. To do so, however, they need several major caveats:
<ul>
<li>The dimension \(d\) is very small; for some constant even integer \(q\), \(k = C_q d^{q-1}\) and \(n \approx \frac{d^q}{100 q \log d}\), where \(C_q\) depends on \(q\). Note that the number of neurons \(k\) can be much bigger than the number of samples \(n\) when \(d\) is very small and \(q\) is large.</li>
<li>\(f\) approximately interpolates the samples. That is, \(\lvert f(x_i) - y_i\rvert \leq 0.1 C_q\) for all \(i \in [n]\). (Note that 0.1 can be replaced by \(\epsilon\) and the result can be generalized.)</li>
<li>The neural network uses the activations \(t \mapsto t^q\) and not the ReLU function.</li>
</ul>
<p>This can be thought of as a tensor interpolation problem. Specifically, for \(q = 2\), they perform regression on the space \(x^{\otimes 2} = (x_1^2, x_1x_2, \dots, x_1 x_d,\dots x_2x_1, x_2^2, \dots, x_d^2)\) using the quadratic activation function.
This approach gives the kind of bound they’re looking for, but is a strange enough case that it’s unclear how to extend this to networks with (1) high input dimensions, (2) perfect interpolation, and (3) ReLU activations.</p>
</li>
</ul>
<p>The paper also gives a few constrained versions of the lower bound on the Lipschitz constant for any interpolating function. However, we omit them here because the second paper—BS21—has much better lower bounds.</p>
<h2 id="lower-bound-from-bs21">Lower bound from BS21</h2>
<p>The follow-up paper proves a mostly-tight lower bound, which effectively resolves half of the conjecture.
The results require th <em>isoperimetry</em> to hold, which is true of a random variable \(x \in \mathbb{R}^d\) if \(f(x)\) has subgaussian tails for every Lipschitz function \(f\).
This holds for well-known distribution such as (1) multivariate Gaussian distributions, (2) the uniform distribution on \(\mathbb{S}^{d-1}\), (3) and the uniform distribution on the hypercube \(\{-1, 1\}^d\).</p>
<p>By combining their Lemma 3.1 and Theorem 3, the following statement is true about 2-layer neural networks:</p>
<p><em>Let \(\mathcal{F}\) be a family of 2-layer neural networks of width \(k\) with parameters in \([-W, W]\). Suppose each sample \((x_i, y_i)\) is drawn from isoperimetric distribution for all \(i \in [n]\) with \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\) and such that \(\| x_i \|_2 \leq R\) almost surely. Then, with high probability, any neural network \(f \in \mathcal{F}\) that perfectly fits all \(n\) training samples will have a Lipschitz constant of</em></p>
\[\Omega\left(\sqrt{\frac{n}{k \log (W R nk)}}\right).\]
<p>This is close to the conjecture up to logarithmic factors! In addition, this result is more general in the paper:</p>
<ul>
<li>Instead of considering only depth-2 neural networks, they consider all parametric models that change by bounded amounts as their parameter vectors change.</li>
<li>Within their study of neural networks, their analysis also addresses networks that share parameters.</li>
<li>A parameter \(\epsilon\) allows them to conclude that all networks that <em>nearly interpolate</em> must have high Lipschitz constant, not just those that perfectly fit the data.</li>
</ul>
<p>They also account for the fact that the bounds on parameter weights with \(W\) are necessary. Through their Theorem 4, they show the existence of a neural network with a small Lipschitz constant that approximates nearly all of the samples with only a single parameter.
Thus, without these kinds of assumptions, the conjecture is rendered uninformative.</p>
<p>The proof works by considering some fixed \(L\)-Lipschitz function \(f\) and asking how likely it is that \(n\) random samples are almost perfectly fit by \(f\).
By isoperimetry, this can be shown to happen with very low probability.
Then, by making use of an \(\epsilon\)-net argument, one can show that no \(L\)-Lipschitz function \(f\) can perfectly fit the samples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/cover.jpeg" alt="" /></p>
<p>While I breezed over the argument here, it’s a relatively simple one that can be followed by most people with some background in concentration inequalities.</p>
<h2 id="further-questions">Further questions</h2>
<p>While the second paper resolves half of the open question from the first paper, the other half (the existence of a smooth interpolating neural network) remains open.</p>
<p>There are also a few caveats from the second paper that remain to be resolved. For one, it may be possible to loosen the restriction that there be non-zero label noise (i.e. \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\)).
In addition, the fact that \(\|x_i\|\) must always be bounded is a weakness, since it rules out Gaussian inputs; perhaps this could be improved.</p>
<p>Thanks for tuning in to this week’s blog post! See you next time!</p>Clayton SanfordThis is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam. Check out this post to get an overview of the topic and a list of what I’m reading.[OPML#6] XH19: On the number of variables to use in principal component regression2021-09-11T00:00:00+00:002021-09-11T00:00:00+00:00http://blog.claytonsanford.com/2021/09/11/xh19<!-- [XH19](https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf){:target="_blank"} [[OPML#6]](/2021/09/11/xh19.html){:target="_blank"} -->
<p><em>This is the 6th of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>Here’s another <a href="https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf" target="_blank">paper</a> by my advisor Daniel Hsu and his former student Ji (Mark) Xu that discusses when overfitting works in linear regression.
This one differs subtly from some of the previously discussed papers (like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>) in that it considers <em>principal component regression</em> (PCR) rather than least-squares regression.</p>
<h2 id="principal-component-regression">Principal component regression</h2>
<p>Suppose we have a collection of \(n\) samples \((x_i, y_i) \in \mathbb{R}^{N} \times \mathbb{R}\), which we collect in design matrix \(X \in \mathbb{R}^{n \times N}\) and label vector \(y \in \mathbb{R}^n\).
The standard approach to least-squares regression (which has been given numerous times on this blog) is to choose the \(\hat{\beta}_\textrm{LS} \in \mathbb{R}^N\) that minimizes \(X \hat{\beta}_\textrm{LS} = y\), breaking ties by minimizing the \(\ell_2\) norm \(\|\hat{\beta}_{\textrm{LS}}\|_2\).
This approach considers all dimensions of the inputs \(x_i\).</p>
<p>However, there might a situation where we know \(\Sigma\) a priori and only want to consider the directions in \(\mathbb{R}^N\) that the inputs meaningfully vary along.
This is where <a href="https://en.wikipedia.org/wiki/Principal_component_regression" target="_blank">principal component regression</a> comes in.
Instead of regressing on the training data itself, we regress on the \(p\) most significant dimensions of the data, as identified by <a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal component analysis</a> (PCA).
PCA is a linear dimensionality reduction method that obtains a lower-dimensional representation of \(X\) by approximating each sample as a linear combination of the \(p\) eigenvectors of \(X^T X\) with the largest corresponding eigenvalues.
These \(p\) eigenvectors correspond to the directions in \(\mathbb{R}^N\) where the samples in \(X\) have highest variance.
Moreover, projecting each of the \(n\) samples \(x_i\) onto the space spanned by these \(p\) eigenvectors provides the closest average \(\ell_2\)-approximation of each \(x_i\) as a linear combination of \(p\) fixed vectors in \(\mathbb{R}^N\).</p>
<p>Let \(\mathbb{E}[x_i] = 0\) and \(\Sigma = \mathbb{E}[x_i x_i^T]\) be the covariance matrix of \(x_i\).
If we know \(\Sigma\) ahead of time, then we can simplify things by using only the eigenvectors of \(\Sigma\), rather than the empirical principal components taken fron eigenvectors of \(X^T X\).
If the \(p\) eigenvectors \(\Sigma\) with the largest eigenvalues are collected in \(V \in \mathbb{R}^{N \times p}\), then we can express the low-dimensional representation of the training samples as \(X V \in \mathbb{R}^{n \times p}\).
By applying linear regression to these new low-dimensional samples and transforming the resulting parameter vector back to \(\mathbb{R}^N\), we get the parameter vector \(\hat{\beta} = V(X V)^{\dagger} y\), where \(\dagger\) denotes the pseudo-inverse.
(On the other hand, the least-squares parameter vector is \(\hat{\beta}_\textrm{LS} = X^{\dagger} y\).)</p>
<p>The below image visualizes the differences between the least squares and PCR regression algorithms.
It shows a toy example where samples \((x, y)\) (in purple) vary greatly in one direction and not much at all in another direction.
PCR only considers the direction of maximum variance and rules the other out, while least squares considers all directions simultaneously.
Therefore, the hypotheses represented by the green hyperplanes look subtly different for each case.</p>
<p><img src="/assets/images/2021-09-11-xh19/vis.jpeg" alt="" /></p>
<p>Note that this formulation of PCR concerns an idealized setting.
Most regression tasks do not give the learner direct access to \(\Sigma\).
However, it’s possible that \(\Sigma\) could be separately estimated with \(\hat{\Sigma}\) and then applied by PCA.
They authors refer to this as “semi-supervised” because the \(\Sigma\) can be estimated with using only unlabeled samples, since none of the labels \(y\) are used in the approximation.
Due to the high cost of obtaining labeled data, a sufficient dataset for kind of estimate may be significantly easier to obtain than a dataset for the general learning task.</p>
<h2 id="learning-model-and-assumptions">Learning model and assumptions</h2>
<p>They make several restrictive assumptions.
The main purpose of this paper is to construct instances where favorable over-parameterization occurs for PCR, rather than exhaustively catalogue when it must occur.</p>
<p>They assume the samples \(x_i\) have independent Gaussian components and that labels \(y_i = \langle x_i, \beta\rangle\) have no noise.
\(\Sigma\) is a diagonal matrix (which must be the case because of the independent components of each \(x_i\)) with entries \(\lambda_1 > \dots > \lambda_N > 0\).
Therefore, PCR will only use the first \(p\) diagonal entries of \(\Sigma\) and the reduced-dimension version of each sample will merely be its first \(p\) entries.</p>
<p>One weird thing about this paper relative to others is that the true parameter vector \(\beta\) is chosen randomly.
This means it’s an “average-case” bound.
They justify this on the grounds that the ability to choose an arbitrary \(\beta\) could lead to all of the weight being put on the \(N-p\) components that will not be included the PCA’d version of \(X\).
This would make it impossible to have non-trivial error bounds.</p>
<h2 id="over-parameterization-and-pcr">Over-parameterization and PCR</h2>
<p>Now, we have three parameters to consider (\(N, p, n\)), rather than the two (\(p, n\)) typically considered in the previous works on over-parameterization.
As before, they think of over-parameterization as the ratio \(\gamma = \frac{p}{n}\), but they must also contend with the ratios \(\alpha = \frac{p}{N}\) (the fraction of dimensions preserved by PCA) and \(\rho = \frac{n}{N}\) (the ratio of samples to original dimension).</p>
<p>Like <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>, they consider what happens when \(N, p, n \to \infty\) and the ratios remain fixed.
Like BLLT19, their results study how over-parameterization is affected as the eigenvalues of \(\Sigma\) change.
In Section 2, they focus on eigenvalues \(\lambda_1, \dots, \lambda_N\) that decay predictably at a polynomial rate.
Theorems 1 and 2/3 characterize what happens to the expected error in the under-parameterized (\(\gamma \leq 1\)) and over-parameterized (\(\gamma > 1\)) respectively.</p>
<ul>
<li>Theorem 1 shows that the shape of the “classical” regime error curve is preserved in the under-parameterized regime, since it shows that the error decreases as \(\alpha\) increases for fixed \(\rho\), up to a point when it decreases until \(\alpha = \rho\) (equivalently, \(p = n\)).</li>
<li>Theorem 2 shows that the expected error in the interpolation regime \(p > n\) converges to some fixed risk quantity, which can be determined by evaluating an intergral and solving for some quantity.</li>
<li>Theorem 3 shows that for any polynomial rate of decay of the eigenvalues, double-descent will occur and the best interpolating prediction rule will perform better than the best “classical” prediction rule.
In the noisy setting, the best interpolating prediction rule will only outperform the best classical rule in the event that the rate of decay is no faster than \(\frac{1}{i}\).</li>
</ul>
<p>To recap, the optimal performance for PCR is obtained in the over-parameterized regime (with \(p > n\)) if and only if eigenvalues \(\lambda_1, \dots, \lambda_N\) decay slowly; rapid decay leads to optimality in the classical regime.
This echoes the results of BLLT19, which shows that too rapid a decay in eigenvalues causes poor performance in the over-paramterized regime (very-much-not-benign overfitting).
However, BLLT19 also requires that the rate of decay not be too slow, which is a non-issue in this regime.</p>
<p>One of the nice things about this paper–which will be expanded on in the weeks to come–is that it separates the number of parameters \(p\) from the dimension \(N\).
Talking about over-parameterization in linear regression is often awkward because the two quantities are coupled, and we are forced to ask whether favorable behavior in the over-parameterized regime is caused by the high dimension or the high parameter count.
We’ll further examine models with separate dimensions and parameter counts when we study random feature models.</p>Clayton SanfordHow many neurons are needed to approximate smooth functions? A summary of our COLT 2021 paper2021-08-15T00:00:00+00:002021-08-15T00:00:00+00:00http://blog.claytonsanford.com/2021/08/15/hssv21<p>In the past few weeks, I’ve written several summaries of others’ work on machine learning theory.
For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, <a href="http://www.cs.columbia.edu/~rocco/" target="_blank">Rocco Servedio</a> and <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>, and another Columbia PhD student, <a href="http://www.cs.columbia.edu/~emvlatakis/" target="_blank">Manolis Vlatakis-Gkaragkounis</a>.
It will be presented this week at <a href="http://learningtheory.org/colt2021/" target="_blank">COLT (Conference on Learning Theory) 2021</a>, which is happening in-person in Boulder, Colorado.
I’ll be there to discuss the paper and learn more about other work in ML theory.
(Hopefully, I’ll put up another blog post after about what I learned from my first conference.)</p>
<p>The paper centers on a question about neural network approximability; namely, how wide does a shallow neural network need to be to closely approximate certain kinds of “nice” functions?
This post discusses what we prove in the paper, how it compares to previous work, why anyone might care about this result, and why our claims are true.
The post is not mathematically rigorous, and it gives only a high-level idea about why our proofs work, focusing more on pretty pictures and intuition than the nuts and bolts of the argument.</p>
<p>If this interests you, you can check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> to learn more about the ins and outs of our work.
There are also two talks—a 90-second teaser and a 15-minute full talk—and a comment thread available on the <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">COLT website</a>.
This blog post somewhat mirrors the longer talk, but the post is a little more informal and a little more in-depth.</p>
<p>On a personal level, this is my first published computer science paper, and the first paper where I consider myself the primary contributor to all parts of the results.
I’d love to hear what you think about this—questions, feedback, possible next steps, rants, anything.</p>
<h2 id="i-whats-this-paper-about">I. What’s this paper about?</h2>
<h3 id="a-broad-background-on-neural-nets-and-deep-learning">A. Broad background on neural nets and deep learning</h3>
<p>As I discuss in the <a href="/2021/07/04/candidacy-overview.html" target="_blank">overview post for my series on over-parameterized ML models</a>, the practical success of deep learning is poorly understood from a mathematical perspective.
Trained neural networks exhibit incredible performance on tasks like image recognition, text generation, and protein folding analysis, but there is no comprehensive theory of why their performance is so good.
I often think about three different kinds of questions about neural network performance that need to be answered.
I’ll discuss them briefly below, even if only only the first question (approximation) is relevant to the paper at hand.</p>
<ol>
<li>
<p><strong>Approximation:</strong> A neural network is a type of mathematical function that can be represented as a hierarchical arrangement of artifical neurons, each of which takes as input the output of previous neurons, combines them together, and returns a new signal. These neurons are typically arranged in <em>layers</em>, where the number of neurons per layer is referred to as the <em>width</em> and the number of layers is the <em>depth</em>.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn.jpeg" alt="" /></p>
<p>Mathematically, each neuron is a function of the outputs of neurons in a previous layer. If we let \(x_1,x_2, \dots, x_r \in \mathbb{R}\) be the outputs of the \(L\)th layer, then we can define a neuron in the \((L+1)\)th layer as \(\sigma(b + \sum_{i=1}^r w_i x_i)\) where \(b \in \mathbb{R}\) is a <em>bias</em>, \(w \in \mathbb{R}^r\) is a weight vector, and \(\sigma: \mathbb{R} \to \mathbb{R}\) is a nonlinear <em>activation function</em>.
If the parameters \(w\) and \(b\) are carefully selected for every neuron, then many layers of these neurons allow for the representation of complex prediction rules.</p>
<p>For instance, if I wanted a neural network to distinguish photos of cats from dogs, the neural network would represent a function mapping the pixels from the input image (which can be viewed as a vector) to a number that is 1 if the image contains a dog and -1 if the image has a cat. Typically, each neuron will correspond to some kind of visual signal, arranged hierarchically based on the complexity of the signal. For instance, a low-level neuron might detect whether a region of the image contains parallel lines. A mid-level neuron may correspond to certain kind of fur texture, and a high-level neuron could identify whether the ears are a certain shape.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn-cat.jpeg" alt="" /></p>
<p>This opens up questions about the expressive properties of neural networks: What kinds of functions can they represent and what kinds can’t they? Does there have to be some kind of “niceness” property of the “pixels to cat” map in order for it to be expressed by a neural network? And how large does the neural network need to be in order to express some kind of function? How does increasing the width increase the expressive powers of the network? How about the depth?</p>
<p><em>This paper asks questions like these about a certain family of shallow neural networks. We focus on abstract mathematical functions—there will be no cats or dogs here—but we believe that this kind of work will better help us understand why neural networks work as well as they do.</em></p>
</li>
<li>
<p><strong>Optimization:</strong> Just because there exists a neural network that can represent the prediction rule you want doesn’t mean it’s possible to algorithmically find that function. The \(w\) and \(b\) parameters for each neuron cannot be feasibly hard-coded by a programmer due to the complexity of these kinds of functions. Therefore, we instead <em>learn</em> the parameters by making use of training data.</p>
<p>To do so, a neural network is initialized with random parameter choices. Then, given \(n\) <em>training samples</em> (in our case, labeled images of cats and dogs), the network tunes the parameters in order to come up with a function that predicts correctly on all of the samples. This procedure involves using an optimization algorithm like <em>gradient descent</em> (GD) or <em>stochastic gradient descent</em> (SGD) to tune a good collection of parameters.</p>
<p>However, there’s no guarrantee that such an algorithm be able to find the right parameter settings.
GD and SGD work great in practice, but they’re only guaranteed to work for a small subset of optimization problems, such as <em>convex</em> problems.
The training loss of neural networks is non-convex and isn’t one of the problems that can be provably solved with GD or SGD; thus, there’s no guarantee of convergence here.</p>
<p><em>There’s lots of interesting work on optimization, but I don’t really go into it in this blog.</em></p>
</li>
<li>
<p><strong>Generalization:</strong> I’ll be brief about this, since I discuss it a lot more in <a href="/2021/07/04/candidacy-overview.html" target="_blank">my series on over-parameterized ML models</a>. Essentially, it’s one thing to come up with a function that can correctly predict the labels of fixed training samples, but it’s another entirely to expect the prediction rule to <em>generalize</em> to new data that hasn’t been seen before.</p>
<p>The ML theory literature has studied the problem of generalization extensively, but most of the theory about this focuses on simple settings, where the number of parameters \(p\) is much smaller than the number of samples \(n\). Neural networks often live in the opposite regime; these complex and hierarchical functions often have \(p \gg n\), which means that classical statistical approaches to generalization don’t predict that neural networks will perform well.</p>
<p><em>Many papers have tried to explain why over-parameterized models exceed expectations in practice, and I discuss some of those in my other series. But again, this paper does not go into this.</em></p>
</li>
</ol>
<h3 id="b-more-specific-context-on-approximation">B. More specific context on approximation</h3>
<p>As mentioned above, this paper (and hence this post) focuses on the first question of approximation. In particular, it discusses the representational power of a certain family of shallow neural networks. (Typically, “shallow” means depth-2—or one-hidden layer—and “deep” means any networks of depth 3 or more.)</p>
<p>There’s a well-known result about depth-2 networks that we build on: The <em>Universal Approximation Theorem</em>, which states that for any continuous function \(f\), there exists some depth-2 network \(g\) that closely approximates \(f\). (We’ll define “closely approximates” later on.)
Three variants of this result were proved in 1989 by <a href="https://www.sciencedirect.com/science/article/abs/pii/0893608089900038" target="_blank">three</a> <a href="https://www.semanticscholar.org/paper/Multilayer-feedforward-networks-are-universal-Hornik-Stinchcombe/f22f6972e66bdd2e769fa64b0df0a13063c0c101" target="_blank">different</a> <a href="https://link.springer.com/article/10.1007/BF02551274" target="_blank">papers</a>.
Here’s a <a href="http://neuralnetworksanddeeplearning.com/chap4.html" target="_blank">blog post</a> that gives a nice explanation of why these universal approximation results are true.</p>
<p>At first glance, it seems like this would close the question of approximation entirely; if a depth-2 neural network can express any kind of function, then there would be no need to question whether some networks have more approximation powers than others. However, the catch is that the Universal Approximation Theorem does not guarantee that \(g\) will be of a reasonable size; \(g\) could be an arbitrarily wide neural network, which obviously is a no-go in the real world where neural networks actually need to be computed and stored.</p>
<p>As a result, many follow-up papers have focused on the question about which kinds of functions can be <em>efficiently</em> approximated by certain neural networks and which ones cannot. By “efficient,” we mean that we want to show that a function can be approximated by a neural network with a size polynomial in the relevant parameters (the complexity of the function, the desired accuracy, the dimension of the inputs). We specifically <em>do not</em> want a function that requires size exponential in any of these quantities.</p>
<p><em>Depth-separation</em> is an area of study that has focused on studying the limitations of shallow networks compared to deep networks.</p>
<ul>
<li>A <a href="http://proceedings.mlr.press/v49/telgarsky16.html" target="_blank">2016 paper by Telgarsky</a> shows that there exist some very “bumpy” triangular functions that can be approximated by neural networks of depth \(O(k^3)\) with polynomial-wdith, but which require exponential width in order to be approximated by networks of depth \(\Omega(k)\).</li>
<li>Papers by <a href="http://proceedings.mlr.press/v49/eldan16.html" target="_blank">Eldan and Shamir (2016)</a>, <a href="http://proceedings.mlr.press/v70/safran17a.html" target="_blank">Safran and Shamir (2016)</a>, and <a href="http://proceedings.mlr.press/v65/daniely17a.html" target="_blank">Daniely (2017)</a> exhibit functions that separate depth-2 from depth-3. That is, the functions can be approximated by polynomial-size depth-3 networks, but they require exponential width in order to be approximated by depth-2 networks.</li>
</ul>
<p>One thing that these papers have in common is that they all require one of two things.
Either (1) the function is a very “bumpy” one that is highly oscillatory, or (2) the depth-2 networks can partially approximate the function, but cannot approximate it to an extremely high degree of accuracy. A <a href="https://arxiv.org/abs/1904.06984" target="_blank">2019 paper by Safran, Eldan, and Shamir</a> noticed this and asked whether there exist “smooth” functions that have separation between depth-2 and depth-3. This question was inspirational for our work, which posed questions to about the limitations of certain kinds of 2-layer neural networks.</p>
<h3 id="c-random-bottom-layer-relu-networks">C. Random bottom-layer ReLU networks</h3>
<p>We actually consider a slightly more restrictive model than depth-2 neural networks. We focus on <em>two-layer random bottom-layer (RBL) ReLU neural networks</em>. Let’s break that down into pieces:</p>
<ul>
<li>
<p>“two layer” means that the neural network has a single hidden layer and can be represented by the following function, for parameters \(u \in \mathbb{R}^r, b \in \mathbb{R}^{r}, w \in \mathbb{R}^{r \times d}\):</p>
\[g(x) = \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>\(r\) is the width of the network and \(d\) is the input dimension.</p>
</li>
<li>“random bottom-layer” means that \(w\) and \(b\) are randomly chosen and then fixed. That means that when trying to approximate a function, we can only tune \(u\). This is also called the <em>random feature model</em> in other papers.</li>
<li>“ReLU” refers to the <em>restricted linear unit</em> activation function, \(\sigma(z) = \max(0, z)\). This is a popular activation function in deep learning.</li>
</ul>
<p>The following graphic visually summarizes the neural network:</p>
<p><img src="/assets/images/2021-08-15-hssv21/rbl.jpeg" alt="" /></p>
<p>Why do we focus on this family of neural networks?</p>
<ol>
<li>Any positive approximation results about this model also apply to arbitrary networks of depth 2. That is, if we want to show that a function can be efficiently approximated by a depth-2 ReLU network, it suffices to show that it can be efficiently approximated by a depth-2 <em>RBL</em> ReLU network. (This does not hold the other direction; there exist functions that can be efficiently approximated by depth-2 ReLU networks that <em>cannot</em> be approximated by depth-2 RBL ReLU nets.)</li>
<li>According to papers like <a href="https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html" target="_blank">Rahimi and Recht (2008)</a>, kernel functions can be approximated with random feature models. This means that our result can also be used to comment on the approximation powers of kernels, which Daniel discusses <a href="https://www.cs.columbia.edu/~djhsu/papers/dimension-argument.pdf" target="_blank">here</a>.</li>
<li>Recent research on the <em>neural tangent kernel (NTK)</em> studies the optimization and generalization powers of randomly-initialized neural networks that do not stray far from their initialization during training. The question of optimizing two-layer neural networks in this regime is then similar to the question of optimizing linear combinations of random features. Thus, the approximation properties proven here carry over to that kind of analysis. Check out papers by <a href="https://arxiv.org/abs/1806.07572" target="_blank">Jacot, Gabriel, and Hongler (2018)</a> and <a href="https://arxiv.org/abs/2002.04486" target="_blank">Chizat and Bach (2020)</a> to learn more about this model.</li>
</ol>
<p>Now, we jump into the specifics of our paper’s claims. Later, we’ll give an overview of how those claims are proven and discuss some broader implications of these results.</p>
<h2 id="ii-what-are-the-specific-claims">II. What are the specific claims?</h2>
<p>The key results in our paper are corresponding upper and lower bounds:</p>
<ul>
<li>If the function \(f: \mathbb{R}^d \to \mathbb{R}\) is either “smooth” or low-dimensional, then it’s “easy” to approximate \(f\) with some RBL ReLU network \(g\). (The upper bound.)</li>
<li>If \(f\) is both “bumpy” and high-dimensional, then it’s “hard” to approximate \(f\) with some RBL ReLU net \(g\). (The lower bound.)</li>
</ul>
<p>All of this is formalized in the next few paragraphs.</p>
<h3 id="a-notation">A. Notation</h3>
<p><strong>What do we mean by a “smooth” or “bumpy” function?</strong> As discussed earlier, works on depth separation frequently exhibit functions that require exponential width to be approximated by depth-2 neural networks. However, these functions are highly oscillatory and hence very steep. We quantify this smoothness by using the Lipschitz constant of a function \(f\). \(f\) has Lipschitz constant \(L\) if for all \(x, y \in \mathbb{R}^d\), we have \(\lvert f(x) - f(y)\rvert \leq L \|x - y\|_2\). This bounds the slope of the function and prevents \(f\) from rapidly changing value. Therefore, a function can only be high-frequency (and bounce back and forth rapidly between large and small values) if it has a small Lipschitz constant.</p>
<p>We also quantify smoothness using the Sobolev class of a function in the appendix of our paper. We provide very similar bounds for this case, but we don’t focus on them in this post.</p>
<p><strong>What does it mean to be easy to approximate?</strong> We consider an \(L_2\) notion of approximation over the solid cube \([-1, 1]^d\). That is, we say that \(g\) <em>\(\epsilon\)-approximates</em> \(f\) if</p>
\[\|g - f\|_2 = \sqrt{\mathbb{E}_{x \sim \text{Unif}([-1, 1]^d)}[(g(x) - f(x))^2]} \leq \epsilon.\]
<p>Notably, this is a <em>weaker</em> notion of approximation than the \(L_\infty\) bounds that are used in other papers. If \(f\) can be \(L_\infty\)-approximated, then it can also be \(L_2\)-approximated.</p>
<p><strong>What does it mean to be easy to approximate <em>with an RBL ReLU function</em>?</strong>
Since we let \(g\) be an RBL ReLU network that has random weights, we need to incorporate that randomness into our definition of approximation. To do so, we say that we can approximate \(f\) with an RBL network of width \(r\) if with probability \(0.5\), there exists some \(u \in \mathbb{R}^r\) such that the RBL neural network \(g\) with parameters \(w, b, u\) can \(\epsilon\)-approximate \(f\).
The probability is over random parameters \(w\) and \(b\) drawn from some distribution \(\mathcal{D}\)
We let the <em>minimum width</em> needed to approximate \(f\) with respect to \(\epsilon\) and \(\mathcal{D}\) denote the smallest such \(r\).</p>
<p>(The paper also includes \(\delta\), which corresponds to the probability of success. For simplicity, we leave it out and take \(\delta = 0.5\).)</p>
<p>We’re now ready to give our two main theorems.</p>
<h3 id="b-the-theorems">B. The theorems</h3>
<p><em><strong>Theorem 1 [Upper Bound]:</strong> For any \(L\), \(d\), \(\epsilon\), there exists a symmetric parameter distribution \(\mathcal{D}\) such that the minimum width of any \(L\)-Lipschitz function \(f: \mathbb{R}^d \to \mathbb{R}\) is at most</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{O(1)}.\]
<p>The term in this bound can also be written as</p>
\[\exp\left(O\left(\min\left(d \log\left(\frac{L^2}{\epsilon^2 d}+ 2\right), \frac{L^2}{\epsilon^2} \log\left(\frac{d\epsilon^2}{L^2} + 2\right)\right)\right)\right).\]
<p><em><strong>Theorem 2 [Lower Bound]:</strong> For any \(L\), \(d\), \(\epsilon\) and any symmetric parameter distribution \(\mathcal{D}\), there exists an \(L\)-Lipschitz function \(f\) whose minimum width is at least</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{\Omega(1)}.\]
<p>Thus, the key take-away is that our upper and lower bounds are matching up to a polynomial factor:</p>
<ul>
<li>When the dimension \(d\) is constant, than both terms are polynomial in \(\frac{L}{\epsilon}\), which means that \(L\)-Lipschitz \(f\) can be efficiently \(\epsilon\)-approximated.</li>
<li>When the smoothness-to-accuracy ratio \(\frac{L}{\epsilon}\) is constant, then the terms are polynomial in \(d\), which is also efficiently approximable.</li>
<li>When \(d = \Theta(L / \epsilon)\), then both terms are exponential in \(d\), which makes it impossible to efficiently approximate.</li>
</ul>
<p>These back up our high-level claim from before: efficient approximation of \(f\) with RBL ReLU networks is possible if and only if \(f\) is either smooth or low-dimensional.</p>
<p>Before explaining the proofs, we’ll give an overview about why these results are significant compared to previous works.</p>
<h3 id="c-comparison-to-previous-results">C. Comparison to previous results</h3>
<p>The approximation powers of shallow neural networks has been widely studied in terms of \(d\), \(\epsilon\), and smoothness measures (including Lipschitzness).
Our results are novel because they’re the first (as far as we know) to look closely at the interplay between these values and obtain nearly tight upper and lower bounds.</p>
<p>Papers that prove upper bounds tend to focus on either the low-dimensional case or the smooth case.</p>
<ul>
<li><a href="http://proceedings.mlr.press/v32/andoni14.html" target="_blank">Andoni, Panigrahy, Valiant, and Zhang (2014)</a> show that degree-\(k\) polynomials can be approximated with RBL networks of width \(d^{O(k)}\). Because \(L\)-Lipschitz functions can be approximated by polynomials of degree \(O(L^2 / \epsilon^2)\), one can equivalently say that networks of width \(d^{O(L^2 / \epsilon^2)}\) are sufficient. This works great when \(L /\epsilon\) is constant, but the bounds are bad in the “bumpy” case where the ratio is large.</li>
<li>On the other hand, <a href="https://jmlr.org/papers/v18/14-546.html" target="_blank">Bach (2017)</a> shows \((L / \epsilon)^{O(d)}\)-width approximability results for \(L_\infty\). This is fantastic when \(d\) is small, but not in the high-dimensional case. (This \(L_\infty\) part is more impressive than our \(L_2\) bounds, which means that we don’t strictly improve upon this result in our domain.)</li>
</ul>
<p>Our results are the best of both worlds, since they trade off \(d\) versus \(L /\epsilon\). They also cannot be substantially improved upon because they are nearly tight with our lower bounds.</p>
<p>Our lower bounds are novel because they handle a broad range of choices for \(L/ \epsilon\) and \(d\).</p>
<ul>
<li>The limitations of 2-layer neural networks were studied in the 1990s by <a href="https://www.sciencedirect.com/science/article/pii/S0021904598933044" target="_blank">Maiorov (1999)</a>, and he proves bounds that looks more impressive than ours at first glance, since he argues that width \(\exp(\Omega(d))\) width is necessary for smooth functions. (He actually looks at Sobolev smooth functions, but the analysis could also be done for Lipschitz functions.) However, these bounds don’t necessarily hold for all choices of \(\epsilon\). Therefore, they don’t say anything about the regime where \(\frac{L}{\epsilon}\) is constant, where it’s impossible to prove a lower bound that’s exponential in \(d\).</li>
<li><a href="https://arxiv.org/abs/1904.00687" target="_blank">Yehudai and Shamir (2019)</a> show that \(\exp(d)\) width is necessary to approximate simple ReLU functions with RBL neural networks. However, their results require that the ReLU be a very steep one, with Lipschitz constant scaling polynomially with \(d\). Hence, this result also only covers the regime where \(\frac{L}{\epsilon}\) is large. Our bounds say something about functions of all levels of smoothness.</li>
</ul>
<p>Now, we’ll break down our argument on a high level, with the help of some pretty pictures.</p>
<h2 id="iii-why-are-they-true">III. Why are they true?</h2>
<p>Before giving the proofs, I’m going to restate the theorems in terms of a combinatorial quantity, \(Q_{k,d}\), which corresponds to the number of \(d\)-dimensional integer lattice points with \(L_2\) norm at most \(k\). That is,</p>
\[Q_{k,d} = \lvert\{K \in \mathbb{Z}^d: \|K\|_2 \leq k \} \rvert.\]
<p>As an example, \(Q_{4,2}\) can be visualized as the number of purple points in the below image:</p>
<p><img src="/assets/images/2021-08-15-hssv21/qkd.jpeg" alt="" width="50%" /></p>
<p>We can equivalently write the upper and lower bounds on the minimum width as \(Q_{2L/\epsilon, d}^{O(1)}\) and \(\Omega(Q_{L/18\epsilon, d})\) respectively. This combinatorial quantity turns out to be important for the proofs of both bounds.</p>
<p>A key building block for both proofs is an orthonormal basis. I define orthonormal bases in <a href="/2021/07/16/orthogonality.html" target="_blank">a different blog post</a> and explain why they’re useful there. If you aren’t familiar, check that one out. We use the following family of sinusoidal functions as a basis for the \(L_2\) Hilbert space on \([-1, 1]^d\) throughout:</p>
\[\mathcal{T} \approx \{T_K: x \mapsto \sqrt{2}\cos(\pi\langle K, x\rangle): K \in \mathbb{Z}^d\}.\]
<p><em>Note: This is an over-simplification of the family of functions to be easier to write down. Actually, half of the functions need to be sines instead of cosines. However, it’s a bit of a pain to formalize and you can see how it’s written up in the paper. I’m using the \(\approx\) symbol above because this is “morally” the same as the true family of functions, but a lot easier to write down.</em></p>
<p>This family of functions has several properties that are very useful for us:</p>
<ul>
<li>
<p>The functions are orthonormal with respect to the Hilbert space for the \(L_2\) space over the uniform distribution on \([-1, 1]^d\). That is, for all \(K. K' \in \mathcal{T}\),</p>
\[\langle T_K, T_{K'}\rangle = \mathbb{E}_{x}[T_K(x)T_{K'}(x)] = \begin{cases}1 & K = K' \\ 0 & \text{otherwise.} \\ \end{cases}\]
</li>
<li>The functions span the Hilbert space \(L_2([-1,1]^d)\). Put together with the orthonormality, \(\mathcal{T}\) is an orthonormal basis for \(L_2([-1,1]^d)\).</li>
<li>The Lipschitz constant of each of these functions is bounded. Specifically, the Lipschitz constant of \(T_K\) is at most \(\sqrt{2} \pi \|K\|_2\).</li>
<li>The derivative of each function in \(\mathcal{T}\) is also a function that’s contained in \(\mathcal{T}\) (if you include the sines too).</li>
<li>All elements of \(\mathcal{T}\) are ridge functions. That is, they can each be written as \(T_K(x) = \phi(\langle v, x \rangle)\) for some \(\phi:\mathbb{R}\to \mathbb{R}\).The function depends only on one direction in \(\mathbb{R}^d\) and is intrinsically one-dimension. This will be important for the upper bound proof.</li>
<li>If we let \(\mathcal{T}_k = \{T_K \in \mathcal{T}: \|K\|_2 \leq k\}\), then \(\lvert\mathcal{T}_k\rvert = Q_{k,d}\).</li>
</ul>
<p>Now, we’ll use this basis to discuss our proof of the upper bound.</p>
<h3 id="a-upper-bound-argument">A. Upper bound argument</h3>
<p>The proof of the upper bound boils down to two steps. First, we show that the function \(f\) can be \(\frac{\epsilon}{2}\)-approximated by a low-frequency trigonometric polynomial (that is, a linear combination of sines and cosines in \(\mathcal{T}_k\) for some \(k = O(L^2 / \epsilon^2)\)). Then, we show that this trigonometric polynomial can be \(\frac{\epsilon}{2}\)-approximated in turn by an RBL ReLU network.</p>
<p>For the first step—which corresponds to Lemma 7 of the paper—we apply the fact that \(f\) can be written as a linear combination of sinusoidal basis elements. That is,</p>
\[f(x) = \sum_{K \in \mathbb{Z}^d} \alpha_K T_K(x),\]
<p>where \(\alpha_K = \langle f, T_K\rangle\).
This means that \(f\) is a combination of sinusoidal functions pointing in various directions of various frequencies.
We show that for some \(k = O(L / \epsilon)\),</p>
\[P(x) := \sum_{K \in \mathbb{Z}^d, \|K\|_2 \leq k} \alpha_K T_K(x)\]
<p>satisfies \(\|P - f\|_2 \leq \frac{\epsilon}{2}\).
To do so, we show that all \(\alpha_K\) terms for \(\|K\|_2 > k\) are very close to zero in the proof of Lemma 8.
The argument centers on the idea that if \(\alpha_K\) is large for large \(\|K\|_2\), then \(f\) is heavily influenced by a high-frequncy sinusoidal function, which means that \(\|\nabla f(x)\|\) must be large at some \(x\).
However, \(\|\nabla f(x)\| \leq L\) by our smoothness assumption on \(f\), so too large values of \(\alpha_K\) contradict this.</p>
<p>For the second part, we show that \(P\) can be approximated by a linear combination of random ReLUs.
To do so, we express \(P\) as a <em>superposition</em> of or expectation over random ReLUs.
We show that there exists some parameter distribution \(\mathcal{D}\) (which depends on \(d, L, \epsilon\), but not on \(f\)) and some bounded function \(h(b, w)\) (which <em>can</em> depend on \(f\)) such that</p>
\[P(x) = \mathbb{E}_{(b, w) \sim \mathcal{D}}[h(b, w)\sigma(\langle w, x\rangle + b)].\]
<p>However, it’s not immediately clear how one could find \(h\) and why one would know that \(h\) is bounded.
To find \(h\), we take advantage of the fact that \(P\) is a linear combination of trigonometric sinusoidal ridge functions by showing that every \(T_K\) can be expressed as a superposition of ReLUs and combining those to get \(h\).
The “ridge” part is key here; because each \(T_K\) is effectively one-dimensional, it’s possible to think of it being approximated by ReLUs, as visualized below:</p>
<p><img src="/assets/images/2021-08-15-hssv21/cos.jpeg" alt="" /></p>
<p>Each function \(T_K\) can be closely approximated by a piecewise-linear ridge function, since it has bounded gradients and because it only depends on \(x\) through \(\langle K, x\rangle\).
Therefore, \(T_K\) can also be closely approximated by a linear combination of ReLUs, because those can easily approximate piecewise linear ridge functions.
This makes it possible to represent each \(T_K\) as a superposition of ReLUs, and hence \(P\) as well.</p>
<p>Now, \(f\) is closely approximated by \(P\), and \(P\) can be written as a bounded superpositition of ReLUs.
We want to show that \(P\) can be approximated by a linear combination of a <em>finite and bounded</em> number of random ReLUs, not an infinite superposition of them.
This last step requires sampling \(r\) sets of parameters \((b^{(i)}, w^{(i)}) \sim \mathcal{D}\) for \(i \in \{1, \dots, r\}\) and letting</p>
\[g(x) := \frac{1}{r} \sum_{i=1}^r h(b^{(i)}, w^{(i)}) \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>When \(r\) is large enough, \(g\) is a 2-layer RBL ReLU network that becomes a very close approximation to \(P\), which means it’s also a great approximation to \(f\).
Such a sufficiently large \(r\) can be quantified with the help of standard concentration bounds for Hilbert spaces.
This wraps up the upper bound.</p>
<h3 id="b-lower-bound-argument">B. Lower bound argument</h3>
<p>For the lower bounds, we want to show that for any bottom-layer parameters \((b^{(i)}, w^{(i)})\) for \(1 \leq i \leq r\), there exists some \(L\)-Lipschitz function \(f\) such that for any choice of top-layer \(u^{(1)}, \dots, u^{(r)}\):</p>
\[\sqrt{\mathbb{E}_x\left[\left(f(x) - \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)})\right)^2\right]} \geq \epsilon.\]
<p>This resembles a simpler linear algebra problem:
Fix any vectors \(v_1, \dots, v_r \in \mathbb{R}^N\).
\(\mathbb{R}^N\) has a standard orthonormal basis \(e_1, \dots, e_N\).
Under which circumstances is there some \(e_j\) that cannot be closely approximated by any linear combination of \(v_1, \dots, v_r\)?</p>
<p>It turns out that when \(N \gg r\) there can be no such approximation.
This follows by a simple dimensionality argument.
The span of \(v_1, \dots, v_r\) is a subspace of dimension at most \(r\).
Since \(r \ll N\), it makes sense that an \(r\)-dimensional subspace cannot be close to every \(N\) orthonormal vector, since they lie in a much higher dimensional object and each is perpendicular to every other.</p>
<p><img src="/assets/images/2021-08-15-hssv21/span.jpeg" alt="" /></p>
<p>For instance, the above image illustrates the claim for \(N = 3\) and \(r = 2\). While the span of \(v_1\) and \(v_2\) is close to \(e_1\) and \(e_2\), the vector \(e_3\) is far from that plane, and hence is inapproximable by linear combinations of the two.</p>
<p>In our setting, we replace \(\mathbb{R}^N\) with the \(L_2\) Hilbert space over functions on \([-1, 1]^d\); \(v_1, \dots, v_r\) with \(x \mapsto \sigma(\langle w^{(1)}, x\rangle + b^{(1)}), \dots, x \mapsto \sigma(\langle w^{(r)}, x\rangle + b^{(r)})\); and \(\{e_1, \dots, e_N\}\) with \(\mathcal{T}_k\) for \(k = \Omega(L)\).
As long as \(Q_{k,d} \gg r\), then there is some \(O(\|K \|_2)\)-Lipschitz function \(T_K\) that can’t be approximated by linear combinations of ReLU features.
By the assumption on \(k\), \(T_K\) must be \(L\)-Lipschitz as well.</p>
<p>The dependence on \(\epsilon\) can be introduced by scaling \(T_K\) appropriately.</p>
<h2 id="parting-thoughts">Parting thoughts</h2>
<p>To reiterate, our results show the capabilities and limitations of 2-layer random bottom-layer ReLU networks.
We show a careful interplay between the Lipschitzness of the function to approximate \(L\), the dimension \(d\), and the accuracy parameter \(\epsilon\).
Our bounds rely heavily on orthonormal functions.</p>
<p>Our results have some key limitations.</p>
<ul>
<li>Our upper bounds would be more impressive if they used the \(L_\infty\) notion of approximation, rather than \(L_2\). (Conversely, our lower bounds would be <em>less</em> impressive if they used \(L_\infty\) instead.)</li>
<li>The distribution over training parameters \(\mathcal{D}\) that we end up using for the upper bounds is contrived and depends on \(L, \epsilon, d\) (even if not on \(f\)).</li>
<li>Our bounds only apply when samples are drawn uniformly from \([-1, 1]^d\). (We believe our general approach will also work for the Gaussian probability measure, which we discuss at a high level in the appendix of our paper.)</li>
</ul>
<p>We hope that these limitations are addressed by future work.</p>
<p>Broadly, we think our paper fits into the literature on neural network approximation because it shows that the smoothness of a function is very relevant to its ability to be approximated by shallow neural networks.</p>
<ul>
<li>Our paper contributes to the question posed by <a href="https://arxiv.org/abs/1904.06984" target="_blank">SES19</a> (Are there any 1-Lipschitz functions that cannot be approximated efficiently by depth-2 but can by depth-3?) by showing that <em>all</em> 1-Lipschitz functions are approximable with respect to the \(L_2\) measure.</li>
<li>In addition, our results build on those of a recent paper by <a href="https://arxiv.org/abs/2102.00434" target="_blank">Malach, Yehudai, Shalev-Shwartz, and Shamir (2021)</a>, that suggests that the only functions that can be efficiently <em>learned</em> via gradient descent by deep networks are those that can be efficiently <em>approximated</em> by a shallow network. They show that the inefficient approximation of a function by depth-3 neural networks implies inefficient learning by neural networks of any depth; our results strengthens this to “inefficient approximation of a function by depth-<strong>2</strong> neural networks.”</li>
</ul>
<p>Thank you so much for reading this blog post! I’d love to hear about any thoughts or questions you may have. And if you’d like to learn more, check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> or <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">the talks</a>!</p>Clayton SanfordIn the past few weeks, I’ve written several summaries of others’ work on machine learning theory. For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, Rocco Servedio and Daniel Hsu, and another Columbia PhD student, Manolis Vlatakis-Gkaragkounis. It will be presented this week at COLT (Conference on Learning Theory) 2021, which is happening in-person in Boulder, Colorado. I’ll be there to discuss the paper and learn more about other work in ML theory. (Hopefully, I’ll put up another blog post after about what I learned from my first conference.)[OPML#5] BL20: Failures of model-dependent generalization bounds for least-norm interpolation2021-07-30T00:00:00+00:002021-07-30T00:00:00+00:00http://blog.claytonsanford.com/2021/07/30/bl20<p><em>This is the fifth of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<!-- [BL20](https://arxiv.org/abs/2010.08479){:target="_blank"} [[OPML#5]](/2021/07/30/bl20.html){:target="_blank"} -->
<p>I really enjoyed reading this paper, <a href="https://arxiv.org/abs/2010.08479" target="_blank">“Failures of model-dependent generalization bounds for least-norm interpolation,”</a> by Bartlett and Long. (The names are familiar from <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>.)
It follows in the vein of papers like <a href="https://arxiv.org/abs/1611.03530" target="_blank">ZBHRV17</a> and <a href="https://arxiv.org/abs/1902.04742" target="_blank">NK19</a>, which demonstrate the limitations of classical generalization bounds.</p>
<p>This work differs from the double-descent papers that have been previously reviewed on this blog, like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a> <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>.
These papers argue that there exist better bounds on generalization error for over-parameterized linear regression than the ones typically suggested by classical approaches like VC-dimension and Rademacher complexity.
However, they dont <em>prove</em> that there cannot be better “classical” generalization bounds; they just show that the well-known bounds are inferior to their proposed bounds.
On the other hand, this paper proves that a broad family of traditional generalization bounds are unable to explain the phenomenon of the success of interpolating methods.</p>
<p>The gist of the argument is that it’s not sufficient to look at the number of samples and the complexity of the hypothesis to explain the success of interpolating models.
Successful bounds must take into account more information about the data distribution.
Notably, the bounds in BHX19, BLLT19, MVSS19, and HMRT19 all rely on properties of the data distribution, like the eigenvalues of the covariance matrix and the amount of additive noise in each label.
The current paper (BL20) posits that such tight bounds are impossible without access to this kind of information.</p>
<p>In this post, I present the main theorem and give a very hazy idea about why it works.
Let’s first make the learning problem precise.</p>
<h2 id="learning-problem">Learning problem</h2>
<ul>
<li>We have labeled data \((x, y) \in \mathbb{R}^d \times \mathbb{R}\) drawn from some distribution \(P\).
<ul>
<li>They restrict \(P\) to give it nice mathematical properties. Specifically, the inputs \(x \in \mathbb{R}^d\) must be drawn from a Gaussian distribution and \((x, y)\) must have subgaussian tails. We’ll call these “nice” distributions.</li>
</ul>
</li>
<li>Let the <em>risk</em> of some prediction rule \(h: \mathbb{R}^d \to \mathbb{R}\) be \(R_P(h) = \mathbb{E}_{x, y}[(y - h(x))^2]\).</li>
<li>Let \(R_P^*\) be the best risk over all \(h\).</li>
<li>The goal is to consider bounds on \(R_P(h) - R_P^*\), where \(h\) is an <em>least-norm interpolating</em> learning rule on \(n\) training samples.
<ul>
<li>i.e. \(h(x) = \langle x, \theta\rangle\) where \(\theta \in \mathbb{R}^d\) minimizes the least-squares error: \(\sum_{i=1}^n(\langle x_i, \theta\rangle - y_i)^2\). Ties are broken by choosing the \(\theta\) that minimizes \(\|\theta\|_2\). The interpolation regime occurs when the least-squares error is zero.</li>
</ul>
</li>
<li>We consider bounds \(\epsilon(h, n, \delta)\), such that \(R_p(h)- R_P^{*} \leq \epsilon(h, n, \delta)\) with probability \(1 - \delta\) over the \(n\) training samples from \(P\), for which \(h\) is least-norm interpolating.
<ul>
<li>Notably, these bounds cannot include any more information about the learning problem; these must hold for any distribution \(P\).</li>
<li>For the theorem to work, they restrict themselves to bounds that are <em>bounded antimonotonic</em>, which means that they cannot suddenly become much worse as the number of samples increases. (e.g. \(\epsilon(h, 2n, \delta)\) cannot be much larger than \(\epsilon(h, n, \delta)\).)</li>
</ul>
</li>
</ul>
<h2 id="the-result">The result</h2>
<p>Now, I give a rather hand-wavy paraphrase of the theorem:</p>
<p><em><strong>Theorem 1:</strong> Suppose \(\epsilon\) is a bound that depends on \(h\), \(n\), and \(\delta\) that applies to all nice distributions \(P\).
Then, for a “very large fraction” of values of \(n\) as \(n\) grows, there exists a distribution \(P_n\) such that</em></p>
\[\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \leq O(1 / \sqrt{n})] \geq 1 - \delta\]
<p><em>but</em></p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq \Omega(1)] \geq \frac{1}{2},\]
<p><em>where \(h\) is the least-norm interpolant of a set of \(n\) points drawn from \(P_n\). The probabilities above refer to randomness from the training sample drawn from \(P_n\).</em></p>
<p>Let’s break this down and talk about what it means.</p>
<p>The generalization bound \(\epsilon\) can depend on the minimum-norm interpolating prediction rule \(h\), the number of samples \(n\), and the confidence parameter \(\delta\).
It <em>cannot</em> depend on the distribution over samples \(P\), and it must apply to all such “nice” distributions.
This opens up the possibility that a satisfactory bound \(\epsilon\) could perform much better on some distributions than others.</p>
<ul>
<li>
<p>This result particularly applies to generalization bounds that make use of some property of the prediction rule \(h\). For instance, it demonstrates the limitations of <a href="https://ieeexplore.ieee.org/document/661502" target="_blank">this 1998 Bartlett paper</a>, which gives generalization bounds that are small when the parameters of \(h\) have small norms.</p>
</li>
<li>
<p>Note that this isn’t really talking about “traditional” capacity-based generalization bounds, like those that rely on <a href="https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension" target="_blank">VC-dimension</a>. These capacity-based bounds are applied to the <em>hypothesis class</em> \(\mathcal{H}\) that contains \(h\), rather than the prediction rule \(h\) itself.</p>
<p>These kinds of bounds are already overly pessimistic in the over-parameterized regime, however. Measurements of the capacity of \(\mathcal{H}\)—like the VC-dimension and the <a href="https://en.wikipedia.org/wiki/Rademacher_complexity" target="_blank">Rademacher complexity</a>—will always lead to vacuous generalization bounds for interpolating classifiers because those bounds rely on limiting the expressive power of hypotheses in \(\mathcal{H}\). From the lens of capacity-based generalization approaches, overfitting is <em>always</em> bad, which makes a nontrivial analysis of interpolation methods impossible with these tools.</p>
</li>
</ul>
<p>\(\epsilon\) does indeed perform much better on some distributions than others. The meat and potatoes of the proof shows the existence of some nice distribution where a bound \(\epsilon\) necessarily underperforms, even though the minimum-norm interpolating solution actually has a small generalization error.</p>
<p><img src="/assets/images/2021-07-30-bl20/bound.jpeg" alt="" /></p>
<ul>
<li>
<p>The first inequality in the theorem demonstrates how the minimum-norm interpolating classifier does well.
This is represnted by the the true generalization errors lying below the green dashed line, which corresponds to the bound in the first inequality.
As \(n\) grows, the true generalization error approaches zero with high probability.</p>
</li>
<li>
<p>On the other hand, the underperformance is illustrated by the second inequality, which shows that the bound \(\epsilon\) often cannot guarrantee that the generalization error is smaller than some constant as \(n\) becomes large.
As visualized above, the bound \(\epsilon\) (represented by red dots with a red line corresponding to the expected value of \(\epsilon\)) will most of the time (but not always) lie above the constant curve denoted by the dashed red line.
This isn’t great, because we should expect an abundance of training samples \(n\) to translate to an error bound that approaches zero as \(n\) approaches infinity.</p>
</li>
</ul>
<p>So far, nothing has been said about the dimension of the inputs, \(d\).
The authors define \(d\) within the context of the distributions \(P_n\) as roughly \(n^2\). Thus, \(d \gg n\) and this problem deals squarely with the over-parameterized regime.</p>
<p>To reiterate, the key takeaway here is that the data distribution is very important for evaluating whether successful generalization occurs.
Without knowledge of the data distribution, it’s impossible to give accurate generalization bounds for the over-parameterized case (\(d \gg n\)).</p>
<h2 id="proof-ideas">Proof ideas</h2>
<p>The main strategy in this proof is to show the existence of a “good distribution” \(P_n\) and a “bad distribution” \(Q_n\) that are very similar, but where minimum-norm interpolation yields a much smaller generalzation error on \(P_n\) than \(Q_n\).
This gap forces any valid generalization error bound \(\epsilon\) to be large, despite the fact that the the minimum-norm interpolator has small generalzation error for \(P_n\).</p>
<p>To satisfy the similarity requirement, \(P_n\) and \(Q_n\) must be indistinguishable with respect to \(h\).
Consider full training samples of \(n\) \(d\)-dimensional inputs and labels \((X_P, Y_P), (X_Q, Y_Q) \in \mathbb{R}^{n \times d} \times \mathbb{R}^n\) drawn from the two respective distributions.
Then, the probability that \(h\) is the minimum-norm interpolator of \((X_P, Y_P)\) must be identical to the probability that it is the minimum-norm interpolator of \((X_Q, Y_Q)\).
If this is the case, then \(\epsilon\) must be defined to ensure that each of</p>
\[\epsilon(h, n, \delta) \geq R_{P_n}(h) - R_{P_n}^* \quad \text{and} \quad \epsilon(h, n, \delta)\geq R_{Q_n}(h) - R_{Q_n}^*\]
<p>hold with probability \(1 - \delta\).
This then means that it must be the case that for any \(t \in \mathbb{R}\):</p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq t] \geq \max(\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \geq t], \mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq t]).\]
<p>To prove the theorem, it suffices to show \(R_{P_n}(h) - R_{P_n}^*\) is very small and \(R_{Q_n}(h) - R_{Q_n}^*\) is large with high probability.
This forces \(\epsilon(h, n, \delta)\) to be large and \(R_{P_n}(h) - R_{P_n}^*\) to be small with high probability, which concludes the proof.</p>
<p>A key idea towards showing this gap between the generalization of \(P_n\) and \(Q_n\) is to define distributions that behave very differently in testing, despite being indistinguishable from the standpoint of training.
To implement this idea, \(Q_n\) will reuse samples in testing phase, while \(P_n\) will not.</p>
<p>Now, we define the two distributions, with the help of a third “helper” distribution \(D_n\).</p>
<h3 id="d_n-the-skewed-gaussian-distribution">\(D_n\): The skewed Gaussian distribution</h3>
<p>We draw an input \(x_i\) from the \(d\)-dimensional Gaussian distribution \(\mathcal{N}(0, \Sigma)\) with mean zero and diagonal covariance matrix \(\Sigma\) with</p>
\[\Sigma_{j,j} = \lambda_j = \begin{cases}
\frac{1}{81} & j = 1 \\
\frac{1}{d^2} & j > 1.
\end{cases}\]
<p>When \(d\) is large, this corresponds to a distribution where \(x_1\) will be very large relative to \(x_2, \dots, x_d\), which trend towards zero.
The label \(y_i\) is drawn by taking \(y_i = \langle x_i, \theta\rangle + \epsilon_i\), where \(\epsilon_i \sim \mathcal{N}(0, \frac{1}{81})\).
Thus, the noise is drawn at the scale of the dominant first coordinate.</p>
<p><img src="/assets/images/2021-07-30-bl20/Dn.jpeg" alt="" /></p>
<p>We use this skewed distribution because it works beautifully with the bounds in the minimum-norm interpolant that are laid out in BLLT19.
Using notation from <a href="/2021/07/11/bllt19.html" target="_blank">my blog post on BLLT19</a>, we can characterize the effective dimensions \(r_k(\Sigma)\) and \(R_k(\Sigma)\), which yield clean risk bounds.</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} =
\begin{cases}
\frac{\frac{1}{81} + \frac{d-1}{d^2}}{\frac{1}{81}} = \Theta(1) & k = 0 \\
\frac{\frac{d-k}{d^2}}{\frac{1}{d^2}} = d-k & k > 0.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2} =
\begin{cases}
\frac{\left(\frac{1}{81} + \frac{d-1}{d^2}\right)^2}{\frac{1}{81^2} + \frac{d-1}{d^4}} = \Theta(1) & k = 0 \\
\frac{\left(\frac{d-k}{d^2}\right)^2}{\frac{d-k}{d^4}} = d-k & k > 0.
\end{cases}\]
<p>By taking \(k^* = 1\) and applying the bound, then with high probability:</p>
\[R(\hat{\theta}) = O\left(\|\theta^*\|^2 \lambda_1\left( \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{r_0(\Sigma)}{n}\right) + \sigma^2\left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right) \right)\]
\[= O\left(\|\theta^*\|^2\left(\frac{1}{\sqrt{n}} + \frac{1}{n}\right) + \frac{1}{81}\left(\frac{1}{n} + \frac{n}{d-1}\right)\right).\]
<p>If we take \(d = n^2\), then this term trends towards zero at a rate of \(\frac{1}{\sqrt{n}}\) as \(n\) approaches infinity, which validates the kind of bound w’ere looking at for \(P_n\).
(Note: \(d\) does not exactly equal \(n^2\) in the paper; there are a few more technicalities here that we’re glossing over.)</p>
<p>This gives us an example where minimum-norm interpolation does fantastically. However, it does not show why the generalization bound \(\epsilon(h, n, \delta)\) cannot be tight.
To do so, we define the actual two distributions we care about—\(Q_n\) and \(P_n\)—in terms of \(D_n\).</p>
<h3 id="q_n-poor-interpolation-from-sample-reuse">\(Q_n\): Poor interpolation from sample reuse</h3>
<p>The first confusing thing about \(Q_n\) is that it’s a random distribution.
That is, we can think of \(Q_n\) being drawn from a distribution over distributions \(\mathcal{Q}_n\), since it depends on a random sample from \(D_n\).</p>
<p>To define \(Q_n\), draw \(m = \Theta(n)\) independent samples \((x_i, y_i)_{i \in [m]}\) from \(D_n\).
\(Q_n\) will be supported on these \(m\) samples.</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn1.jpeg" alt="" /></p>
<p>After fixing these samples, we can draw \((x, y)\) from \(Q_n\) by first uniformly selecting \(x\) from \(\{x_1, \dots, x_m\}\), the set of pre-selected points.
Then, we choose \(y\) using the same approach that we did for \(D_n\): \(y = \langle x, \theta\rangle + \epsilon\) for \(\epsilon \sim \mathcal{N}(0, \frac{1}{81})\).</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn2.jpeg" alt="" /></p>
<p>What this means is that the training inputs \(x_i\) for \(i \in [n]\) will exactly reoccur in the expected risk, albeit with different labels \(y_i\).
This differs greatly from \(D_n\), where the continuity of the distribution over \(x_i\)’s ensures that the same exact sample would never realistically be chosen in “testing.”</p>
<p>The crux of the argument that \(Q_n\) is “bad” comes from Lemma 5, which suggests that least-norm interpolation will perform poorly on inputs \(x_i\) that show up exactly once in the training set.
When these are drawn again when computing the expected risk (with new labels), they’ll have substantially higher error than would a random input from \(D_n\).
This allows the authors to show that—for a proper choice of \(m\)—</p>
\[\mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq \Omega(1)] \geq \frac{1}{2}.\]
<p>Now, it only remains to show that \(Q_n\) is indistinguishable in the training phase from a “good” distribution that has low risk for least-norm interpolation.</p>
<p>\(D_n\) is good, but \(Q_n\) unfortunately cannot be contrasted to \(D_n\) in this manner.
Because \(D_n\) never repeats training samples, the two two have somewhat different distributions over interpolators \(h\).
Instead, we define \(P_n\) in a slightly different way to have the nice interpolation properties of \(D_n\), while being identical to \(Q_n\) in the training phase.</p>
<h3 id="p_n-d_n-but-with-extra-samples">\(P_n\): \(D_n\) but with extra samples</h3>
<p>The idea with \(D_n\) is that it draws inputs \(x_i\) from \(P_n\), but that it will occasionally draw more than one and average their labels \(y_i\) together to produce a new label.</p>
<p>This provides indistinguishability from \(Q_n\) in the training phase.
Both draw a collection of samples—with some of them appearing multiple times in the training set—and both minimum-norm interpolators will take these properties into account.
This indistinguishability is proved in Lemma 7 and relies on careful choices of the number of original samples \(m\) for \(Q_n\) and the amount of repeated samples in \(P_n\). This idea is put together with Lemma 5 (which shows that \(Q_n\) has poor minimum-norm interpolation behavior) to show that \(\epsilon(h, n, \delta)\) cannot be small.</p>
<p>However, \(P_n\) is <em>not</em> a random distribution and it will <em>not</em> carry that repetition over to the “evaluation phase.”
The distribution used to evaluate risk—like \(D_n\) and unlike \(Q_n\)—will not contain any of the same \(x_i\)’s that were used in the training phase.
This causes the interpolation guarantees to be roughly the same as \(D_n\).
This gives the gap we’re looking for, which is formalized in Lemma 10.</p>
<p>Put together with Lemma 5, this gives the bound we’re looking for and concludes the story that the success (or lack thereof) of minimum-norm interpolation can only be understood by considering the data distribution, and <em>not</em> just the number of samples \(n\) and properties of the interpolants \(h\).</p>
<p><em>Thanks for reading the post! As always, I’d love to hear any thoughts and feedback. Writing these is very instructive for me to make sure I actually understand the ideas in these papers, and I hope they provide some value to you too.</em></p>Clayton SanfordThis is the fifth of a sequence of blog posts that summarize papers about over-parameterized ML models.[OPML#4] HMRT19: Surprises in high-dimensional ridgeless least squares interpolation2021-07-23T00:00:00+00:002021-07-23T00:00:00+00:00http://blog.claytonsanford.com/2021/07/23/hmrt19<!-- [HMRT19](https://arxiv.org/abs/1903.08560){:target="_blank"} [[OPML#4]](/2021/07/23/hmrt19.html){:target="_blank"} -->
<p><em>This is the fourth of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>This week’s <a href="https://arxiv.org/abs/1903.08560" target="_blank">paper</a>
is one by Hastie, Montanari, Rosset, and Tibshirani, which studies the cases in over-parameterized least-squares regression where the generalization error is small.
It follows in the vein of the papers reviewed so far (<a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, and <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a> <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>), which all present circumstances where such “benign overfitting” takes place in \(\ell_2\)-norm minimizing linear regression.</p>
<p>This summary will be a bit shorter than the previous ones, since a lot of the ideas here have already been discussed.
The paper is highly mathematically involved; it covers a lot of ground and gives theorems that are very general.
However, the core message about when it’s possible for favorable interpolation to occur is similar to that of BLLT19, so I’ll mainly focus on presenting the results of this paper on a high level and explaining the similarities between the two papers.</p>
<p>The paper is also nearly seventy pages long, and there’s a lot of interesting content about non-linear models and mis-specified models (which generalizes the case of double-descent considered in BHX19) that I won’t discuss for the sake of brevity.</p>
<p>The paper differs from BLLT19 because it considers a broader range of data distributions (e.g. samples \(x_i\) need not be drawn from probability distribution with subgaussian tails) and because it lies in an asymptotic regime.
Concretely, the three other papers previously considered give bounds in terms of the number of samples \(n\) and the number of parameters \(p\), where they are taken to be large, but not infinite.
Here, we instead fix some ratio \(\gamma = \frac{p}{n} > 1\) to represent how over-parameterized the model is and ask what happens when \(n, p \to \infty\).
This means that we’ll need to consider subtly different settings than I discussed in my post about <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, because some of those have \(p = \infty\) and another has \(p = \Theta(n \log n)\).
It’s necessary here to only consider numbers of parameters \(p\) that grow linearly with the number of samples \(n\).</p>
<h2 id="data-model">Data model</h2>
<p>The data model is mostly the same as the previous papers, minus the aforementioned differences in distributional assumptions and growth of \(n\) and \(p\).</p>
<p>We draw \(n\) random samples \((x_i, y_i) \in \mathbb{R}^p \times \mathbb{R}\) where \(x_i\) is drawn from distribution with mean \(\mathbb{E}[x_i] = 0\), covariance \(\mathbb{E}[x_i x_i^T] = \Sigma\), and bounded low-order moments.
(This moment assumption is a weaker assumption to make than subgaussianity, which makes these results more impressive.)
For some parameter vector, \(\beta \in \mathbb{R}^p\) and random noise \(\epsilon_i\) with variance \(\sigma\), the label \(y_i\) is set by taking \(y_i = \langle x_i, \beta \rangle + \epsilon_i\).</p>
<p>For simplicity, we’ll assume (as we have before) that \(\Sigma\) is a diagonal matrix with entries \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0\).
That way, we can assume each coordinate \(x_{i,j}\) of \(x_i\) is drawn independently and we can consider how the output of least-squares regression is affected by the variances \(\lambda_1, \dots, \lambda_p\).
(The paper allows \(\Sigma\) to be any symmetric positive definite matrix, and it instead considers the output of least-squares regression in terms of the eigenvalues of \(\Sigma\), rather than the variances of each independent component.)</p>
<p>As mentioned above, for some fixed over-parameterization ratio \(\gamma > 1\), we’ll let \(p = \gamma n\) and let \(n \to \infty\).</p>
<p>Given a training sample collected in input matrix \(X\) and label vector \(y\), the solution to minimum-norm least-squares is the \(\hat{\beta} \in \mathbb{R}^p\) that minimizes \(\|\hat{\beta}\|_2\) and interpolates the training samples: \(X \hat{\beta} = y\).
The goal—like in other papers about over-parameterized least-squares regression—is to bound the expected squared risk of the prediction rule \(\hat{\beta}\) on a new sample \(x\):</p>
\[R_X(\hat{\beta}; \beta) = \mathbb{E}_{x}[(\langle x, \hat{\beta}\rangle - \langle x, \beta\rangle)^2].\]
<p>Like in BLLT19, the analysis works by decomposing this risk into a bias term \(B_X(\hat{\beta}; \beta)\) and a variance term \(V_X(\hat{\beta}; \beta)\).</p>
<h2 id="main-result">Main result</h2>
<p>Their main result is Theorem 2, which shows that as \(n\) and \(p\) become arbitarily large, the bias \(B_X\) and variance \(V_X\) converge to <em>predicted bias</em> \(\mathscr{B}\) and <em>predicted variance</em> \(\mathscr{V}\).
For this bound to hold, they require that for some constant \(M\) that does not depend on \(n\), the largest component variance has \(\lambda_1 \leq M\) and the smallest has \(\lambda_p \geq \frac{1}{M}\).</p>
<p>This means that the variances cannot decay to zero like BLLT19 relies on!
It will still matter that some variances be significantly smaller than others, but not in the same way.</p>
<p>Now, I’ll define the predicted bias and variance and try to explain the inutution behind them:</p>
\[\mathscr{B}(\gamma) = \left(1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\right) \sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]
<p>and</p>
\[\mathscr{V}(\gamma) = \sigma^2 \gamma \mathbf{c_0} \frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}},\]
<p>where \(c_0\) depends on \(\gamma\) and satisfies</p>
\[1 - \frac{1}{\gamma} = \frac{1}{p} \sum_{j=1}^p \frac{1}{1 + \gamma c_0 \lambda_j}.\]
<p><em>Note: The predicted variance differs from the version presented in the paper.
I additionally include the bolded \(c_0\) term, which I suspect was left out as a typo.
In its current version, the bound in Theorem 2 is inconsistent with the specialized bound in Theorem 1, so I suspect that it was just an omission of a variable in the variance statement.</em></p>
<p>If you’re anything like me, you find these expressions a little terrifying and hard to understand.
Let’s break them down into pieces to try to grasp how the value of \(\gamma\) affects the risk as \(n\) and \(p\) become large.</p>
<p>The rough intuition for the impact of over-parameterization on these two terms is that growth of \(\gamma\) hurts bias and helps variance.
However, this doesn’t seem immediately obvious; indeed, the variance appears to <em>grow</em> as \(\gamma\) increases.
It’s necessary to understand the product \(\gamma c_0\) in order to get why this is the case.
We’ll first consider a simple isotropic case to understand what happens to that term, and then hand-wavily revisit the general case.</p>
<h3 id="isotropic-data">Isotropic data</h3>
<p>For simplicity, consider the <em>isotropic</em> or <em>spherical</em> case, where \(\Sigma = I_p\) and \(\lambda_1 = \dots = \lambda_p = 1\).
(“Isotropic” literally translates to “equal change.”)
Then, taking \(c_0 := \frac{1}{\gamma(\gamma - 1)}\) satisfies the condition on \(c_0\).
Now, we can plug in \(c_0 \gamma = \frac{1}{\gamma - 1}\) into the expressions for predicted bias and predicted variance:</p>
\[\mathscr{B}(\gamma) = \left(1 + \frac{1}{\gamma - 1} \right) \frac{1}{(1 + \frac{1}{\gamma-1})^2} \sum_{j=1}^p \beta_j^2 = \frac{\|\beta\|_2^2}{1 - \frac{1}{\gamma -1}} = \frac{\|\beta\|_2^2( \gamma-1)}{\gamma}.\]
\[\mathscr{V}(\gamma) = \frac{\sigma^2}{\gamma - 1}.\]
<p>Thus, as \(\gamma\) becomes larger (and the learning model becomes more over-parameterized), the bias will approach \(\|\beta\|^2\) and the variance will approach zero.
This isn’t really good new for the isotropic case…
The bias rapidly approaches \(\|\beta\|^2\) as \(\gamma\) grows, which will make it impossible for the risk to be small.</p>
<p>It’s possible for the excess risk to decrease as \(\gamma\) grows in the case where the signal to noise ratio \(\frac{\|\beta\|^2}{\sigma^2}\) is large, but the excess risk will still be worse than it would be in parts of the classical regime where \(\gamma < 1\).</p>
<p>As with BLLT19, to see the benefits of overfitting, we need to look at how the variances decay in the anisotropic setting.</p>
<h3 id="intuition-for-the-general-case">Intuition for the general case</h3>
<p>We’ll continue to think of \(c_0 \gamma\) as something that decays to zero as \(\gamma\) becomes large.
If that weren’t the case and \(c_0 \gamma\) were large, then each term \(\frac{1}{1 + \gamma c_0 \lambda_j}\) would be small for \(1 \leq j \leq p\), and it’s impossible for their average to be \(1 - \frac{1}{\gamma}\), since that’s close to one.</p>
<p>Now, we’ll talk through each of the components of the predicted bias and variance to speculate in a hand-wavy way about how this result applies.</p>
<p>Let’s start with the variance term.</p>
<ul>
<li>First, if we think of \(\gamma c_0\) as something like \(\frac{1}{\gamma - 1}\) (or at least something that decays as \(\gamma\) increases), then the variance goes to zero as the model becomes more over-parameterized.
This checks out with our intuition from BLLT19 and BHX19.</li>
<li>Also intuitively, the variance drops if the noise \(\sigma\) drops.
If there’s no noise, then all of the model’s error will come from the bias.</li>
<li>
<p>Now, the hard part.</p>
\[\frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]
<p>will be thought of roughly as corresponding to the rate of component variance decay.
Since \(\gamma c_0\) is small and the variances \(\lambda_j\) are bounded above, most of the \((1 + \gamma c_0 \lambda_j)^2\) terms should be close to 1.
Making that sketchy simplification, we instead have</p>
\[\frac{\sum_{j=1}^p \lambda_j^2}{\sum_{j=1}^p \lambda_j}.\]
<p>This looks sorta similar to the \(R_0(\Sigma)\) term from BLLT, except that it would square the denominator.
The term (and hence, the variance) is small when there’s a gap between the high-variance components and the low-variance components, or when some \(\lambda_j\)’s are much larger than other \(\lambda_j\)’s.
This corresponds to the requirement from BLLT19 that the decay must be sufficiently fast.</p>
</li>
</ul>
<p>Thus, you get small variance if there’s some combination of heavy over-parameterization, low noise, and rapid decay of variances.
Now, we look at bias.</p>
<ul>
<li>
<p>The first term</p>
\[1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]
<p>will be roughly 1 when the model is over-parameterized (because of \(\gamma c_0\)) or when the variances \(\lambda_i\) drop sufficiently fast.</p>
</li>
<li>
<p>The final term</p>
\[\sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]
<p>looks at the correlations between “important” directions in the true parameters \(\beta\) and variances \(\lambda_j\).
If we again treat \((1 + \gamma c_0 \lambda_j)^2 \approx 1\), then this term is \(\sum_{j=1}^p \beta_j^2 \lambda_j\).
This is approximately \(\|\beta\|^2\) (and thus large) if most of the weight of \(\beta\) lies in high-variance directions.
It will then be small if a sufficiently the weight of \(\beta\) is divided into many medium-importance components.
This seems analogous to the BLLT19 requirement that the decay of weights not be too rapid.</p>
</li>
</ul>
<!-- ### One other case
To try to make the intuition for the bias term make sense, I'll go over one more specific case, where different distributions of weight over the parameter vector $$\beta$$ will lead to different levels of acceptable over-parameterization.
-->
<p>Thanks for reading this blog post! As always, let me know if you have thoughts or feedback.
(As of now, there’s no way to comment on the blog. My original attempt with Disqus led to the introduction of a bunch of terrible ads to this blog. I’ll be back with something soon, which will hopefully be less toxic.)</p>Clayton Sanford[OPML#3] MVSS19: Harmless interpolation of noisy data in regression2021-07-16T00:00:00+00:002021-07-16T00:00:00+00:00http://blog.claytonsanford.com/2021/07/16/mvss19<p><em>This is the third of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>This is a <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">2019 paper</a> by Muthukumar, Vodrahalli, Subramanian, and Sahai, which will be known as <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a>.
Like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, it considers the question of when least-squares linear regression performs well in the over-parameterized regime.
One of the great things about this paper is that it goes beyond giving mathematical conditions needed for a low expected risk of interpolation.
It additionally suggests intuitive mechanisms for how it works, which helps motivate the conditions that BLLT19 impose.</p>
<h2 id="overview">Overview</h2>
<p>To recap, we’ve so far studied two settings where double-descent occurs in linear regression:</p>
<ul>
<li>The <em>misspecified setting</em>, where the under-parameterized model lacks access to features of the data that are essential for predicting the label \(y\). BHX19 studies this setting.
Success in the over-parameterized setting depends on the increased access to data components causing decreased variance for the predictors.</li>
<li>The setting where the variances of the components of input \(x\) decay at a rate that is neither too slow nor too fast. This is explored in BLLT19.</li>
</ul>
<p>MVSS19 studies a similar setting to BLLT19 with decreasing variances.
They do so by treating the over-parameterized learning model as the process of choosing between <em>aliases</em>—hypotheses that perfectly interpolate (or fit) the training samples and minimize empirical risk.
As the complexity of a model increases beyond the point of overfitting, the number of aliases increases rapidly, which means that an empirical-risk-minimizing algorithm (like least-squares regression) has many choices of learning rules to choose from, some of which might have good generalization properties.</p>
<p><img src="/assets/images/2021-07-16-mvss19/alias.jpeg" alt="" /></p>
<p>This paper answers two questions for a broad category of data models:</p>
<ol>
<li><strong>What is the population error of the best \(d\)-parameter linear learning rule \(f: \mathbb{R}^d \to \mathbb{R}\) that interpolates all \(n\) training samples (that is, \(f(x_i) = y_i\) for all \(i \in [n]\))?</strong> They answer this question in Section 3 by characterizing the “fundamental price of interpolation.” In doing so, they show that it is essential that \(d \gg n\) for an interpolating solution to perform well. That is, dramatic over-parameterization is necessary for any learning algorithm to obtain a rule that fits the training samples and has a low expected risk.</li>
<li>
<p><strong>When does the over-parameterized least-squares algorithm choose a good interpolating classifier?</strong> While (1) tells us that there exists some alias with low risk when \(d \gg n\), it doesn’t tell us whether this particular learning algorithm will find it. They introduce a framework in Section 4 for analyzing <em>signal bleed</em> (when the true signal present in the training samples is distributed among many aliases, making all of them bad) and <em>signal contamination</em> (when the noise from the training samples corrupts the chosen alias). This framework justifies the “not too fast/not too slow” conditions from BLLT19 and argues that a gradual decay of variances is necessary to ensure that least-squares obtains a learning rule that neither ignores the signal nor is corrupted by noise.</p>
<p><em>Note: The paper actually considers a general covariance matrix \(\Sigma\) for the inputs \(x_i\) and does not require that each of the \(d\) components be uncorrelated with all others.
Thus, instead of considering the rate of decay of the variances of each independent component, this paper (and BLLT19) instead consider the rate of decay of the eigenvalues of \(\Sigma\).
It’s then possible for favorable interpolation to occur when in cases where every component of \(x_i\) has the same variance, but the eigenvalues of \(\Sigma\) decay at a gradual rate because of correlations between components.</em></p>
</li>
</ol>
<p>They have plenty of other interesting stuff too.
The end of Section 4 discusses Tikhonov (ridge) regression, which adds a regularization terms and does not overfit, but does outperform least-squares interpolation for a proper choice of regularization parameters.
Section 5 focuses on a broader range of interpolating regression algorithms (such as <em>basis pursuit</em>, which minimizes \(\ell_1\) error rather than the \(\ell_2\) error of least-squares) and proposes a hybrid method between the \(\ell_1\) and \(\ell_2\) approaches that obtains the best of both worlds.
However, for the sake of simplicity, we’ll keep this summary to the two questions above.</p>
<h2 id="what-can-go-wrong-with-interpolation">What can go wrong with interpolation?</h2>
<p>Towards answering these questions, the authors identify three broad cases when interpolation approaches fail.</p>
<h3 id="failure-1-too-few-aliases">Failure #1: Too few aliases</h3>
<p>If \(d\) is not much larger than \(n\), then the model is over-parameterized, but only just.
As a result, there are relatively few aliases that interpolate all of the samples \((x_i, y_i)\). (This roughly corresponds to the second and third panels of the above graphic.)
Frequently, none of these will be any good, since they might all fall into the typical pitfalls of overfitting: in order to perfectly fit the samples, the underlying trend in the data is missed.</p>
<p>Noisy labels (\(y_i = \langle x_i, \beta\rangle + \epsilon_i\) for random \(\epsilon_i\) with variance \(\sigma^2\)) exacerbate these issues.
If few aliases are available, most of them will be heavily affected by the noisy samples.
Indeed, the authors of this paper argue that the only way to ensure the existence of an interpolating learning rule that is not knocked askew by the noise is to have many aliases.
Thus, interpolation will not work without over-parameterization; we must require that \(d \gg n\).
More on this later.</p>
<h3 id="failure-2-signal-bleed">Failure #2: Signal bleed</h3>
<p>In this case, we have plenty of aliases, but they’re all different:</p>
<p><img src="/assets/images/2021-07-16-mvss19/fail2.jpeg" alt="" /></p>
<p>The above image shows that there are three different interpolating solutions that fit the orange points, but they are uncorrelated with one another.</p>
<p>(<em>Sidebar: These aliases don’t look like linear functions, but that’s because they’re being applied to the Fourier features of the input. This will be discussed later.</em>)</p>
<p>Suppose the true learning rule is represented by the cyan constant-one alias.
We’re doomed if the learning algorithm chooses the purple or red aliases because those are uncorrelated with the cyan alias and will label the data with no better accuracy than chance.
The least-squares algorithm will produce a learning rule that averages all three together, which will also poorly approximates the true curve.
Thie phenomenon is known as <em>signal bleed</em>, because the helpful signal provided by the data is diluted by being distributed between several aliases that are uncorrelated.</p>
<p>To avoid signal bleed, the learning algorithm needs to somehow be biased in favor of lower-frequency or simpler features.
This is why the BLLT19 paper requires that the variances of each component decay at a sufficiently fast rate.
If they don’t, then there is no way to break ties among uncorrelated aliases, which dooms them to a bad solution.</p>
<h3 id="failure-3-signal-contamination">Failure #3: Signal contamination</h3>
<p>Suppose once again, we’re in a setting with many different aliases, some of which are uncorrelated with one another.
If we consider the noise \(w_i\) added to each label, then every one of the aliases will somehow be corrupted when the noise is added.
Ideally, we want to show that as the number of samples and number of parameters become large, the impact of the noise on the chosen interpolating alias will be minor.</p>
<p>For this to be possible, we have to ensure that the noise is diluted among the different aliases.
This is the opposite of what we want for the signal!
We know that the noise will corrupt the aliases, but if there are many uncorrelated aliases, the corruption can either be relatively evenly distributed among the different aliases (<em>noise dissipation</em>) or concentrated in a few (<em>signal contamination</em>).
The first case can then be used to argue that any alias chosen by the learning algorithm will be minimally affected by noise, which is great!</p>
<p>One way to ensure that noise is diluted among aliases is to impose some degree of similar weight on aliases under consideration.
In the land of BLLT19, this means guaranteeing that the rate of decay of variances is not <em>too</em> fast.
This poses the trade-off explored in BLLT19 and here: There’s a sweet spot in the relative importance of different features from the perspective of the learning algorithm that must be found in order to avoid either signal bleed or signal contamination.</p>
<p><img src="/assets/images/2021-07-16-mvss19/fail3.jpeg" alt="" /></p>
<p>Before jumping in to these results more formally, we introduce two data models that we’ll refer back to.</p>
<h2 id="data-models">Data models</h2>
<p>In both cases, inputs \(x\) are chosen from some procedure and label \(y = \langle x, \beta\rangle + \epsilon\), where \(\beta\) is the unknown true signal and \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) is independent Gaussian noise.
We let \(X \in \mathbb{R}^{n \times d}\), \(Y \in \mathbb{n}\), and \(W \in \mathbb{n}\) contain all of the training inputs, labels, and noise respectively.</p>
<p>The least-squares algorithm returns the minimum-norm \(\hat{\beta} \in \mathbb{R}^d\) that interpolates the training data: \(X \hat{\beta} = Y\).</p>
<p><em>Note: This notation is slightly different than the notation used in their paper. I modified to make it line up more closely with BHX19 and BLLT19.</em></p>
<h3 id="model-1-gaussian-features">Model #1: Gaussian features</h3>
<p>Every input \(x \in \mathbb{R}^d\) is drawn independenty from a multivariate Gaussian \(\mathcal{N}(0, \Sigma)\), where \(\Sigma \in \mathbb{R}^{d\times d}\) is a covariance matrix.</p>
<h3 id="model-2-fourier-features">Model #2: Fourier features</h3>
<p>For any \(j \in [d]\), we define the \(j\)th <em>Fourier feature</em> to be a function \(\phi_j: [0, 1] \to \mathbb{C}\) with \(\phi_j(t) = e^{2p (j-1) i t}\).
Because \(e^{iz} = \cos(z) + i \sin(z)\), \(\phi_j(t)\) can be thought of as a sinusoidal function with frequency increasing with \(j\).
For any \(t \in [0, 1]\), it’s Fourier features are \(\phi(t) = (\phi_1(t), \dots, \phi_d(t)) \in \mathbb{C}^d\).</p>
<p>Notably, these features are orthonormal and uncorrelated.
That is,</p>
\[\langle \phi_j, \phi_k \rangle = \mathbb{E}_{t \sim \text{Unif}[0, 1]}[\phi_j(x) \phi_k(x)] = \begin{cases}
1 & \text{if } j=k, \\
0 & \text{otherwise.}
\end{cases}\]
<p>To learn more about orthonormality and why its a desirable trait in vectors and functions, check out <a href="/2021/07/16/orthogonality.html" target="_blank">my post</a> on the subject.</p>
<p>We generate the training samples by choosing \(n\) evenly spaced points on the interval \([0, 1]\): \(t_j = \frac{i-1}{n}\) for all \(i \in [n]\).
The features of the \(j\)th sample are \(x_j = \phi(t_j) = (1, e^{2 \pi i t_j}, e^{2\pi(2i) t_j}, \dots, e^{2\pi((d-1)i) t_j}) \in \mathbb{C}^d\).
The feature vectors for each sample are also orthonormal: \(\langle x_j, x_k \rangle = 1\) if \(j = k\) and \(0\) otherwise.</p>
<p>The below image gives a visual of the sinusoidal interpretation of Fourier features and the training samples:</p>
<p><img src="/assets/images/2021-07-16-mvss19/fourier.jpeg" alt="" /></p>
<h2 id="the-necessity-of-over-paramterization">The necessity of over-paramterization</h2>
<p>Section 3 of the paper studies the “fundamental price of interpolation” by asking about how good the best interpolating classifier can be.
Specifically, the ideal test risk of any interpolating classifier:</p>
\[\mathcal{E}^* = \min_{\beta \in \mathbb{R}^d: X \beta = Y} \mathbb{E}_{(x, y)}[(y - \langle x, \beta\rangle)^2] - \sigma^2.\]
<p>The condition \(X \beta = Y\) ensures that \(\beta\) does indeed fit all of the training samples.
The variance of the noise \(\sigma^2\) is subtracted because no classifier can ever hope to have risk better than the noise, since every label will be corrupted.</p>
<p>They prove upper- and lower-bounds on \(\mathcal{E}^*\) that hold with high probability. In particular, by Corollaries 1 and 2, with probability 0.9:</p>
<ul>
<li>Under the Gaussian features model, \(\mathcal{E}^* = \Theta(\frac{\sigma^2 n}{d})\).</li>
<li>Under the Fourier features model, \(\mathcal{E}^* = \Omega(\frac{\sigma^2 n}{d \log n})\).</li>
</ul>
<p>Therefore, in order to guarantee that the risk approaches the best possible as \(n\) and \(d\) grow, it must be the case that \(d \gg \sigma^2 n\).
That is, it’s essential for the model to be over-paramterized for the interpolation to be favorable.
This formalizes Failure #1 by highlighting that without enough aliases (which are provided by having a highly over-parameterized model), even the best alias will have poor performance.</p>
<p>These proofs first use linear algebra to exactly represent \(\mathcal{E}^*\) in terms of inputs \(X\), covariance \(\Sigma\), and noise \(\epsilon\).
Then, they apply concentration bounds to show that the risk is close to its expectation with high probability over the input data and the noise.</p>
<h2 id="not-too-fast-not-too-slow">Not too fast; not too slow</h2>
<p>Here, we recap Section 4 of the paper while studying the Fourier features setting.
In doing so, we explain how Failures #2 and #3 can occur.
We focus on Fourier features because their orthogonality properties make the concepts of signal bleed and signal contamination much cleaner.</p>
<h3 id="signal-bleed">Signal bleed</h3>
<p>Consider a simple learning problem where each \(x\) is a Fourier feature and \(y = 1\) no matter what. (There is no noise here.)
In this case, our samples will be of the form \((\phi(t_1), 1), \dots, (\phi(t_n), 1)\) for \(t_1, \dots, t_n\) evenly spaced in \([0, 1]\).</p>
<p>First, we ask ourselves which solutions will interpolate between the samples.
Since the \(j\)th Fourier feature is the function \(\phi_j(t) = e^{2p (j-1) i t}\), the first Fourier feature \(\phi_1(t) = 1\) is an interpolating alias.
(It’s also the correct alias.)
However, so too will be \(\phi_j\) when \(j-1\) is a multiple of \(n\). This is orthogonal (uncorrlated) to the first feature (and all other Fourier features).
If there are \(d\) Fourier features and \(n\) samples for \(d \gg n\), there are \(\frac{d}{n}\) interpolating aliases, all of which are orthogonal.</p>
<p>This is a problem.
This forces our algorithm to choose between \(\frac{d}{n}\) different candidate learning rules, all of the which are completely uncorrelated with one another, without having any additional information about which one is best.
Indeed, the interpolating learning rule can be any function of the form \(\sum_{j = 0}^{d/n} a_j \phi_{nj+1}(t)\) for \(\sum_{j = 0}^{d/n} a_j = 1\).</p>
<p>How does the least-squares algorithm choose a parameter vector \(\beta\) from all of these interpolating solutions?
It chooses the one with the smallest \(\ell_2\) norm. By properties of orthogonality, this is equivalent to choosing the function minimizing \(\sum_{j = 0}^{d/n} a_j^2\), which is satisfied by taking \(a_j = \sqrt{\frac{n}{d}}\).
This means that \(\beta_1 = \sqrt{\frac{n}{d}}\).
Equivalently, the true feature \(\phi_1\) contributes only a \(\sqrt{\frac{n}{d}}\) amount of influence on the learning rule, which diminishes as \(d\) grows and the model becomes further over-parameterized.</p>
<p>This is why we refer to this failure mode (Failure #2) as <em>signal bleed</em>: the signal conveyed in \(\phi_1\) bleeds into all other \(\phi_{jn + 1}\) until the true signal has almost no bearing on the outcome.</p>
<p><strong>How can this be fixed?</strong> By giving a higher weight to “simpler” features in order to indicate some kind of preference for these features.
The higher weight permits the \(\ell_2\) norm of a classifier to contain a large amount of influence \(\phi_1\) without incurring a high cost.</p>
<p>To make this concrete, let’s rescale each \(\phi_j\) such that \(\phi_j = \sqrt{\lambda_j} e^{2p (j-1) i t}\).
Now, the interpolating aliases are \(\frac{1}{\sqrt{\lambda_j}} \phi_j\) whenever \(j\) is one more than a multiple of \(n\), which means that the higher-frequency features will be more costly to employ.
This time, we can express any learning rule in the form \(\sum_{j = 0}^{d/n} \frac{a_j}{\sqrt{\lambda_j}} \phi_{nj+1}(t)\) for \(\sum_{j = 0}^{d/n} a_j = 1\).
Least-squares will then choose the learning rule whose \(a_j\) values minimize \(\sum_{j = 0}^{d/n} \frac{a_j^2}{\lambda_j}\).
This will be done by taking:</p>
\[a_j = \frac{\lambda_j}{\sum_{k=0}^{d/n} \lambda_{kn +1}},\]
<p>Going back to our Fourier setting where \(\phi_1\) is the only true signal, our classifier will perform best if \(a_1 \approx 1\), which occurs if \(\frac{\lambda_0}{\sum_{k=0}^{d/n} \lambda_{kn +1}} \to 1\) as \(n\) and \(d\) become large.
(The quantity that must approach 1 is known as the <em>survival factor</em> in this paper.)
For this to be possible, there must be a rapid drop-off in \(\lambda_j\) as \(j\) grows.</p>
<p>Interestingly, this coincides with BLLT19’s requirements for “benign overfitting.”
The survival factor coincides is the inverse of their \(r_0(\Sigma)\) term, which captures the gap between the largest variance and the sum of the other variances.
As was discussed in <a href="/2021/07/11/bllt19.html" target="_blank">that blog post</a>, the quantity must much smaller than \(n\) for their bound to be non-trivial.</p>
<p>Figure 5 of their paper provides a nice visualization of how dropping the weights on high-frequency can lead to better interpolating solutions that avoid signal bleed.
The top plot has a large gap between the weights on the low-frequency features and the high-frequency features, which prevents least-squares from giving too much preference to the high-frequency features that just happen to interpolate the training data.
The bottom plot produces a spiky and inconsistent plot because it fails to do so.</p>
<p><img src="/assets/images/2021-07-16-mvss19/bleed.jpeg" alt="" /></p>
<p>This logic seems circular somehow: in order to have good interpolation, we must be able to select for the good features and weight them strongly enough so that their aliases override orthogonal aliases.
However, if we know the good features, why include the bad features in the first place?
The next section part discusses why it’s important in the interpolation regime to not let the importance of features (represented by \(\lambda_j\)) drop too rapidly.</p>
<h3 id="signal-contamination">Signal contamination</h3>
<p>In the previous section, we were concerned about the “true signal” of \(\phi_1\) being diluted by the preference of least-squares for higher-frequency Fourier features.
To combat that, it was necessary to drop the variances of the high-frequency features by some sequence \(\lambda_j\) that decreases sufficiently quickly.</p>
<p>Here, we’re concerned with the opposite issue: the incorrect influence of orthonormal high-frequency aliases and noise on the learning rule inferred by least-squares.
In this Fourier features setting, all contributions from other aliases will necessarily increase the risk because the other aliases are all orthogonal to the signal \(\phi_1\).
As before, we can quantify the minimum error caused by the inclusion of other aliases in the prediction, which we’ll call the <em>contamination</em>:</p>
\[C = \sqrt{\sum_{k = 1}^{d/n} \hat{\beta}_{kn + 1}^2}.\]
<p>In the case of least-squares regression, we have:</p>
\[C = \frac{\sqrt{\sum_{k=1}^{d/n} \lambda_{kn+1}}}{\sum_{k = 0}^{d/n} \lambda_{kn +1}}.\]
<p>We’re interested in finding weights \(\lambda_j\), which ensure that the contamination \(C\) becomes very small a regime where \(d\) and \(n\) are very large.
One way to do so is to choose \(\lambda_j\) such that \(\sqrt{\sum_{k=1}^{d/n} \lambda_{kn+1}} \ll\sum_{k = 1}^{d/n} \lambda_{kn +1}\), which occurs when the sum of weights is large and the decay of \(\lambda\) is heavy-tailed.
That is, to avoid having spurious features have a lot of bearing on the final learning rule, one can require that \(\lambda\) decays very slowly, so that the lower-frequency spurious features are not given much more weight than the higher-frequency features.</p>
<p>Taken together, this section and the previous section impose a trade-off how features should be weighted.</p>
<ul>
<li>To avoid signal bleeding, it’s necessary for a relatively small number of features to have much more weight than the rest of them.</li>
<li>To avoid signal contamination, the remaining features need to jointly have a large amount of weight and the weights cannot decay too quickly.</li>
</ul>
<p>This is the same trade-off presented by BLLT19 with their \(r_k(\Sigma)\) and \(R_k(\Sigma)\) terms.
For their bounds to be effective, it’s necessary to have that \(r_0(\Sigma) \ll n\) (prevent signal bleed by mandating decay of feature variances) and \(R_{k^*}(\Sigma) \gg n\) where \(k^*\) is a parameter that divides high-variance and low-variance features (prevent signal contamination by requiring that the variances decay sufficiently slowly).</p>
<h2 id="conclusion-and-next-steps">Conclusion and next steps</h2>
<p>Like the other papers discussed so far, the results of this paper apply to a very clean setting.
The Fourier features examples illustrate these contamination-vs-bleed trade-offs in a very clean way because the orthogonality of the features means that all features other than the signal are strictly detrimental.
Still, this paper is nice because it motivates the mathematical conditions specified in BLLT19 and gives more intuition into when one should expect least-squares interpolation to succeed.</p>
<p>The paper suggests that further works focus on the powers of approximation of more complex models and how they relate to success in the interpolation regime.
This is where there’s a key difference between BHX19 and BLLT19/MVSS19.
The over-parameterized models in the former explicitly have more information in comparison to their under-parameterized counterparts, so they have a clear advantage in the kinds of functions they can approximate.
On the other hand, the success of over-parameterized models in BLLT19 and MVSS19 are solely dependent on the relative variances of many features; they don’t say anything about the fact that most over-parameterized models can express more kinds of functions.
The authors hope that future work continues to study interpolation through the lens of signal bleed and signal contamination, but that they also find a way to work in the real approximation theoretic advantages that over-parameterized models maintain over other models.</p>
<p>I personally enjoyed reading this paper a lot, because I found it very intuitive and well-written. I’d recommend checking it out directly if you find this interesting!</p>Clayton SanfordThis is the third of a sequence of blog posts that summarize papers about over-parameterized ML models.Orthonormal function bases: what they are and why we care2021-07-16T00:00:00+00:002021-07-16T00:00:00+00:00http://blog.claytonsanford.com/2021/07/16/orthogonality<p>When writing <a href="/2021/07/04/candidacy-overview.html" target="_blank">posts on over-parameterized ML models</a> in preparation for my candidacy exam, I realized that many of the theoretical results I discuss rely heavily on <em>orthonormal functions</em>, and that they’ll be difficult for readers to understand without having some background.
This post introduces orthonormal families of functions and explains some of the properties that make them convenient mathematical tools.
If you want a more thorough (or just plain better) introduction, check out Ryan O’Donnell’s textbook (available for free on <a href="http://www.cs.cmu.edu/~odonnell/" target="_blank">his website</a>).</p>
<h2 id="orthonormality-of-vectors">Orthonormality of vectors</h2>
<p>For now, forget that I ever said anything about functions being orthonormal.
We’ll instead focus on vectors.
We define some terms:</p>
<ul>
<li>If \(x\) and \(y\) are vectors in \(\mathbb{R}^n\), then \(x\) and \(y\) are <em>orthogonal</em> if they are perpendicular.
Mathematically, they’re defined to be orthogonal if \(\langle x, y \rangle = 0\), where \(\langle x, y\rangle = \sum_{i=1}^n x_i y_i\) is the <em>inner product</em>.</li>
<li>They are <em>orthonormal</em> if they additionally have unit norm: \(\| x \|_2 = \| y \|_2 = 1\), where \(\|x \|_2 = \sqrt{\langle x , x\rangle}\) is the \(\ell_2\) norm.</li>
<li>
<p>\(u_1, \dots, u_n\) is an <em>orthonormal basis</em> for \(\mathbb{R}^n\) if they are a basis for \(\mathbb{R}^n\) (that is, \(\text{span}(u_1, \dots, u_n) = \mathbb{R}^n\) and they are linearly independent) and if all pairs of vectors are orthonormal.
Equivalently, for all \(i, j \in \{1, \dots, n\}\):</p>
\[\langle u_i, u_j \rangle = \delta_{i, j} := \begin{cases}
1 & \text{if } i = j \\
0 & \text{otherwise.}
\end{cases}\]
</li>
</ul>
<p>This basis can be thought of as a rotation of the coordiante axes, since each basis element is perpendicular to every other element.</p>
<p><img src="/assets/images/2021-07-16-orthogonality/vector.jpeg" alt="" /></p>
<p>For example, the above image has an orthonormal basis \(u_1, u_2\) of \(\mathbb{R}^2\). The point \(x\) can be equivalently written as \((x_1, x_2)\) using the standard coordinates axes and as \(\langle x, u_1\rangle u_1 + \langle x, u_2\rangle u_2\) using the rotated axes.</p>
<p>An orthonormal basis \(u_1, \dots, u_n\) of \(\mathbb{R}^n\) is an extremely useful thing to have because it’s easy to to express any vector \(x \in \mathbb{R}^n\) as a linear combination of basis vectors.
The fact that \(u_1, \dots, u_n\) is a basis alone guarantees that there exist coefficients \(a_1, \dots, a_n \in \mathbb{R}\) such that \(x = \sum_{i=1}^n a_i u_i\); their orthonormality makes those coefficients easy to compute.
Indeed, it simply holds that \(a_i = \langle x, u_i \rangle\) for all \(i\); this can be verified by considering the inner product and applying the orthonormality of the basis elements:</p>
\[\langle x, u_i \rangle = \sum_{j=1}^n a_j \langle u_j, u_i\rangle = \sum_{j=1}^n a_j \delta_{i, j} = a_i.\]
<p>This gives rise to some of nice properties:</p>
<ul>
<li>If we let \(a = (a_1, \dots, a_n) \in \mathbb{R}^n\), then \(\| a\|_2 = \|x \|_2\).</li>
<li>For some other \(x' \in \mathbb{R}^n\) with \(x'= \sum_{i=1}^n a_i' u_i\), then \(\langle x, x'\rangle = \langle a, a'\rangle\).</li>
<li>If \(x\) and \(y\) are orthogonal, then \(\|x\|_2^2 + \|y\|_2^2 = \|x + y\|_2^2\). (This is the Pythagorean theorem!)</li>
</ul>
<h2 id="generalizing-orthonormality-to-function-spaces">Generalizing orthonormality to function spaces</h2>
<p>These concepts can be generalized beyond simple vector spaces to consider other spaces defined with inner products.
If \(\mathcal{X}\) is a <a href="https://en.wikipedia.org/wiki/Hilbert_space" target="_blank">Hilbert space</a> with inner product \(\langle \cdot, \cdot \rangle_{\mathcal{X}}\), then we can define \(x, y \in \mathcal{X}\) as orthonormal if \(\langle x, y \rangle_{\mathcal{X}} = 0\) and \(\langle x, x\rangle_{\mathcal{X}} = \langle y, y\rangle_{\mathcal{X}} = 1\).</p>
<p>One important category of Hilbert spaces are \(L_2\) function spaces with distribution \(\mathcal{D}\) over \(\mathcal{X}\).
Let \(L_2(\mathcal{D}) = \{f: \mathcal{X} \to \mathbb{R}: \|f\|_{\mathcal{D}} < \infty\}\), where \(\|f\|_{\mathcal{D}} = \sqrt{\mathbb{E}_{x \sim \mathcal{D}} [f(x)^2]}\).
This is a Hilbert space with inner-product \(\langle f, g\rangle_{\mathcal{D}} = \mathbb{E}_{x \sim \mathcal{D}}[f(x) g(x)]\) which contains all functions with bounded \(L_2\) norm over this distribution \(\mathcal{D}\).
One way to think about this is to think of each function \(f\) as a vector \((f(x))_{x \in \mathcal{X}}\) with infinitely many coordinates and of the inner product as a vector inner product that is weighted by the distribution.</p>
<p>This is a really nice thing to have, because it permits the easy definition of an orthonormal basis for function spaces.
This in turn enables functions to be easily represented in terms of other simpler functions, which is useful for all kinds of analysis.
We say that \(\mathcal{U} \subseteq L_2(\mathcal{D})\) is an <em>orthonormal basis</em> for \(L_2(\mathcal{D})\) if the following hold:</p>
<ol>
<li>\(\mathcal{U}\) spans \(L_2(\mathcal{D})\). That is, for all functions \(f \in L_2(\mathcal{D})\), there exist coefficients \(a_{u} \in \mathbb{R}\) for all \(u \in \mathcal{U}\) such that \(f(x) = \sum_{u \in \mathcal{U}} a_u u(x)\) for all \(x \in \mathcal{X}\).</li>
<li>The functions in \(\mathcal{U}\) are orthonormal with respect to \(\mathcal{D}\). That is, \(\langle u, u'\rangle_{\mathcal{D}} = \delta_{u, u'}\) for all \(u, u' \in \mathcal{U}\).</li>
</ol>
<p>These conditions are the same as the conditions for orthonormal bases for vectors, and the properties transition over too!</p>
<ul>
<li>For all \(u \in \mathcal{U}\), \(a_u = \langle f, u\rangle_{\mathcal{D}}\).</li>
<li>\(\|a\|_2 = \sqrt{\sum_{u \in \mathcal{U}} a_u^2} = \|f\|_{\mathcal{D}}\). (This is called the <em>Plancherel theorem</em>.)</li>
<li>For \(f': \mathcal{X} \to \mathbb{R}\) with \(f' = \sum_{u \in \mathcal{U}} a_u' u\), \(\langle a, a'\rangle = \langle f, f'\rangle_{\mathcal{D}}\). (This is called <em>Parseval’s theorem</em>.)</li>
<li>If \(\langle f, f'\rangle_{\mathcal{D}} = 0\), then \(\|f\|_{\mathcal{D}}^2 + \|f'\|_{\mathcal{D}}^2 = \|f + f'\|_{\mathcal{D}}^2\).</li>
</ul>
<p>To explain why this is useful, I introduce several examples of orthonormal bases, which typically come in handy.</p>
<h3 id="example-1-parities-over-the-boolean-cube">Example #1: Parities over the Boolean cube</h3>
<p>Let \(\mathcal{X}\) be the \(n\)-dimensional Boolean cube \(\{-1, 1\}^n\) and let \(\mathcal{D}\) be the uniform distribution over the cube.
Then, we can write \(\langle f, g\rangle_{\mathcal{D}} = \frac{1}{2^d} \sum_{x \in \{-1, 1\}^n} f(x) g(x)\).</p>
<p>For some \(S\subseteq [n]:= \{1, \dots n\}\), we define a <em>parity function</em> \(\chi_{S}: \{-1, 1\}^n\) to be \(\chi_S(x) = \prod_{i\in S} x_i\).
That is, it returns \(1\) if the number of negative coordinates \(x_i\) for \(i \in S\) are even and \(-1\) if they are odd.
A parity function is <em>high-frequency</em> if \(|S|\) is large (because flipping a single bit of \(x\) is likely to change the value of \(\chi_S\)) and <em>low-frequency</em> if \(|S|\) is small.</p>
<p><img src="/assets/images/2021-07-16-orthogonality/parity.jpeg" alt="" /></p>
<p>The figure shows two parities defined on \(\{-1, 1\}^4\), one low-frequency and one high-frequency.
Note that high-frequency parities change their value much more frequently when moving between adjacent vertices.</p>
<p>The set of all \(2^n\) parity functions \(\{\chi_S: S \subseteq [n]\}\) is an orthonormal basis of \(L_2(\mathcal{D})\), which means that every function \(f\) taking input over the Boolean cube can be expressed as a linear combination of parity functions: \(f = \sum_{S \subseteq [n]} a_S \chi_S\), for \(a_S = \langle f, \chi_S\rangle_{\mathcal{D}}\).</p>
<p>When talking about Fourier expansions (which will be briefly discussed in the next example), functions are thought of as having two equivalent representations:</p>
<ul>
<li>The traditional representation, where \(f\) is thought of as a collection of input/output pairs \((x, f(x))\).</li>
<li>The frequency representation, where \(f\) is thought of as a linear combination of basis elements, which can be parameterized by \((a_S)_{S \subset [n]}\).</li>
</ul>
<p>Numerous strands of Boolean function analysis rely on dividing a function into high-frequency and low-frequency features, and these equivalent representations are an essential tool towards doing so.</p>
<h3 id="example-2-fourier-series-over-the-interval">Example #2: Fourier series over the interval</h3>
<p>Let \(\mathcal{X} = [-1, 1]\) and let \(\mathcal{D} = \text{Unif}([-1, 1])\).
Then, any \(f: [-1, 1] \to \mathbb{R}\) with finite \(\mathbb{E}_{x \sim \text{Unif}([-1, 1])}[f(x)^2]\) be can be expressed as a <em>Fourier series</em> by making use of the following orthonormal basis: \(\mathcal{U} = \{u_j: j \in \mathbb{Z}\}\), for \(u_j(x) = e^{i 2\pi j x}\).
(For this example, \(i = \sqrt{-1}\).)
These functions are complex-valued, but they still satisfy the conditions necessary for orthonormal bases, which allow functions to be decomposed into high-frequency and low-frequency components.</p>
<p>For people who like trigonometric functions more than complex-valued functions, this basis can be re-written by applying Euler’s formula, \(e^{ix} = \cos(x) + i \sin(x)\):</p>
\[\mathcal{U'} = \{x \mapsto 1\} \cup \{x \mapsto \sqrt{2} \cos(2\pi j x): j \in \mathbb{Z}_+\} \cup\{x \mapsto \sqrt{2} \sin(2\pi j x): j \in \mathbb{Z}_+\}.\]
<p>Thus, we can \(f\) as:</p>
\[f(x) = a_0 + \sum_{j=1}^{\infty}\left(\sqrt{2} a_j \cos(2\pi j x) + \sqrt{2} b_j\sin(2\pi j x) \right)\]
<p>for \(a_0 = \langle f, 1\rangle_{\mathcal{D}} = \mathbb{E}_{x}[f(x)]\), \(a_j = \langle f, \sqrt{2} \cos(2\pi j \cdot)\rangle_{\mathcal{D}}\) for \(j \geq 1\), and \(b_j = \langle f, \sqrt{2} \sin(2\pi j \cdot)\rangle_{\mathcal{D}}\).</p>
<p>Again, this gives us a nice decomposition of \(f\) into high- and low-frequency terms.
If \(|a_j|\) is large for large values of \(j\), then \(f\) is likely to be “highly bumpy.”
Conversely, rapidly decaying values of \(|a_j|\) as \(j\) grows implies that \(f\) will be smooth and closely approximable by low-frequency sines and cosines.
Moreover, Plancheral gives us a nice relationship between the norm of the function \(f\) and the size of its coefficients \(a\) and \(b\):</p>
\[\|f\|_{\mathcal{D}}^2 = a_0^2 + \sum_{j=1}^{\infty}(a_j^2 + b_j^2).\]
<p>This toolset is really useful for proving facts about functions that satisfy some notion of “smoothness.”
My collaborators and I use a generalization of this orthonormal basis in our paper <a href="https://arxiv.org/abs/2102.02336" target="_blank">HSSV21</a> to show that smooth functions (which have bounded Lipschitz constant) can be closely approximated by shallow neural networks with random bottom-layer weights and that some “bumpy” functions with large Lipschitz constants cannot be approximated.</p>
<h3 id="example-3-legendre-polynomials-over-the-interval">Example #3: Legendre polynomials over the interval</h3>
<p>For the same setting as Example #2, there’s another popular orthonormal basis: the <a href="https://en.wikipedia.org/wiki/Legendre_polynomials">Legendre polynomials</a>.
Roughly, there are a family of polynomials \(p_0, p_1, \dots\) that are an orthonormal basis for \(L_2(\mathcal{D})\) such that \(p_i\) is a polynomial of degree \(i\).
Instead of thinking of decomposing a function over the interval into high- and low-frequency terms, we can now think of the functions as a combination of high- and low-degree polynomials.
It’s like a Taylor expansion, except that each Legendre polynomial is uncorrelated to every other Legendre polynomial.</p>
<p><img src="/assets/images/2021-07-16-orthogonality/legendre.png" alt="" /></p>
<h3 id="example-4-hermite-polynomials-over-gaussian-space">Example #4: Hermite polynomials over Gaussian space</h3>
<p>If we instead let \(\mathcal{X} = \mathbb{R}\) and let \(\mathcal{D}\) be the standard normal distribution \(\mathcal{N}(0, 1)\), then the normalized probabilist’s <a href="https://en.wikipedia.org/wiki/Hermite_polynomials">Hermite polynomials</a> are an orthonormal basis for \(L_2(\mathcal{D})\).
These again have a nice properties.
In particular, each Hermite polynomial \(h_i\) can be defined recursively in terms of \(h_{i-2}\) and \(h_{i-1}\).</p>
<p><img src="/assets/images/2021-07-16-orthogonality/hermite.png" alt="" /></p>
<p>(This and the previous image were shamelessly stolen from the respective Wikipedia articles.)</p>Clayton SanfordWhen writing posts on over-parameterized ML models in preparation for my candidacy exam, I realized that many of the theoretical results I discuss rely heavily on orthonormal functions, and that they’ll be difficult for readers to understand without having some background. This post introduces orthonormal families of functions and explains some of the properties that make them convenient mathematical tools. If you want a more thorough (or just plain better) introduction, check out Ryan O’Donnell’s textbook (available for free on his website).[OPML#2] BLLT19: Benign overfitting in linear regression2021-07-11T00:00:00+00:002021-07-11T00:00:00+00:00http://blog.claytonsanford.com/2021/07/11/bllt19<!-- [BLLT19](https://arxiv.org/abs/1906.11300){:target="_blank"} [[OPML#2]](/2021/07/11/bllt19.html){:target="_blank"} -->
<p><em>This is the second of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>This week’s paper is known as <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>.
Written by Peter Bartlett, Philip Long, Gabor Lugosi, and Alexander Tsigler, this paper is similar to <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> in that both give examples of situations where linear regression models perform better when they have more parameters than samples.
However, the two papers study different situations when over-parameterized perform well:</p>
<ul>
<li>BHX19 considers a data model where the learner is given only \(p\) out of \(D\) total features and demonstrates that a second descent occurs for expected risk as \(p\) increases beyond \(n\) (the number of samples).</li>
<li>BLLT19 instead gives the learner access to all of the features and proves bounds on the population error when the model undergoes “benign overfitting.”
While they do not strictly give a double-descent curve, they analyze how this benign overfitting can occur when the training data are drawn from a distribution which satisfies certain covariance properties.</li>
</ul>
<h2 id="data-model">Data model</h2>
<p>This section introduces a simplified version of their data model and learning algorithm.
This simplifications are noted below and make it easier to explain their theoretical results.</p>
<ul>
<li>A labeled sample \((x, y) \in \mathbb{R}^p \times \mathbb{R}\) (where \(p\) may be infinite) is drawn as follows:
<ul>
<li>
<p>\(x\) is sampled from the multivariate Gaussian distribution \(\mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a diagonal covariance matrix.</p>
<p><em>Simplification #1: \(\Sigma\) need not be diagonal and \(x\) can instead be drawn from a distribution with subgaussian tails. The paper analyzes the eigenvalues of \(\Sigma\), but since we assume that \(\Sigma\) is diagonal, the eigenvalues are exactly the diagonal entries \(\Sigma_{i,i} = \mathbb{E}[x_i^2] > 0\).</em></p>
</li>
<li>
<p>There is some true parameter vector \(\theta^* \in \mathbb{R}^p\) with finite norm \(\| \theta^* \|_2 = \sqrt{\sum_{j=1}^p \theta_j^{*2}}\). (This was \(\beta\) in BHX19.)</p>
</li>
<li>
<p>\(y\) is drawn by sampling \(\epsilon\) from a normal distribution \(\mathcal{N}(0, \sigma^2)\) for some \(\sigma > 0\) and letting \(y = x^T \theta^* + \epsilon\).</p>
<p><em>Simplification #2: \(\epsilon\) is not necessarily Gaussian. Instead, it’s a subgaussian random variable that can depend on \(x\) with a lower-bound on the expectation of its square.</em></p>
</li>
</ul>
</li>
<li>
<p>The learner is provided with \(n\) samples \((x_1, y_1), \dots, (x_n, y_n)\), whose inputs and labels are collected into \(X \in \mathbb{R}^{n \times d}\) and \(y \in\mathbb{R}^p\) respectively.</p>
</li>
<li>
<p>The learner uses the <em>minimum norm estimator</em> (least-squares) \(\hat{\theta}\) to predict \(\theta^*\).
This is the same estimator used in BHX19: \(\hat{\theta} = X^T (X X^T)^{-1} y\), which is the vector \(\theta\) minimizing \(\| \theta\|^2\) such that \(X \theta = y\).</p>
<p><em>Technicality: We can assume that \(X X^T\) is invertible as long as \(p \geq n\).
Because they’re drawn from a multivariate Gaussian distribution, \(x_1, \dots, x_n\) will span an \(n\)-dimensional subspace almost surely, which makes \(X\) full-rank.</em></p>
</li>
<li>
<p>The learner’s prediction \(\hat{\theta}\) is evaluated by the <em>excess risk</em>:</p>
<p>\(R(\hat{\theta}) = \mathbb{E}[(y - x^T \hat{\theta})^2 - (y - x^T \theta^*)^2] = \mathbb{E}[(y - x^T \hat{\theta})^2] - \sigma^2\).</p>
</li>
</ul>
<p>The main object that the authors study is the choice of covariance \(\Sigma\).
For simplicity, assume that \(\lambda_i := \Sigma_{i,i}\) for \(1 \leq i \leq p\) and \(\lambda_1 \geq \lambda_2 \geq \dots\).
The results depend on how the diagonals of matrix decay.
If the diagonals decay slowly or not at all, then all of the features have a similar impact on the resulting label \(y\).
Otherwise, if they decay rapidly, a few features will have an outsized impact on the label.
To quantify this decay, they introduce two measurements of the <em>effective rank</em> of \(\Sigma\).
For some \(k \in [0, p-1]\), let \(r_{k}(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}}\) and \(R_{k}(\Sigma) = \frac{(\sum_{i > k} \lambda_i)^2}{\sum_{i > k} \lambda_i^2}\).
Both terms will be small if the variances of rapidly decrease beyond the \((k+1)\)th component of \(x\); they’ll be large if the variances decay slowly or not at all.</p>
<p>Like in last week’s summary, we introduce several settings with different choices of \(\Sigma\), which we’ll refer back to later when discussing the main result.</p>
<ul>
<li><strong>Setting A: Finite features, no decay.</strong>
Let \(p\) be finite and \(\Sigma = I_p\). Then, \(x\) is <em>isotropic</em>, meaning that all of its components have equal variance. We have \(r_k(\Sigma) = R_k(\Sigma) = p - k\).</li>
<li><strong>Setting B: Infinite features, rapid decay.</strong>
Define \(\Sigma\) with \(\lambda_i = \frac{1}{2^i}\). Then, \(r_k(\Sigma) = \frac{1 / 2^k}{1 / 2^{k+1}} = 2\) and \(R_k(\Sigma) = \frac{1 / 4^{k}}{1 / (3 \cdot 4^k)} = 3\).</li>
<li><strong>Setting C: Infinite features, less rapid decay.</strong>
Define \(\Sigma\) with \(\lambda_i = \frac{1}{i^2}\). Then, \(r_k(\Sigma) = \frac{\sum_{i=k+1}^\infty i^{-2}}{(k+1)^{-2}} = \frac{\Theta(1/k)}{(k+1)^{-2}} =\Theta(k)\) and \(R_k(\Sigma) = \frac{(\sum_{i=k+1}^\infty i^{-2})^2}{\sum_{i=k+1}^\infty i^{-4}} = \frac{\Theta(1/k^2)}{\Theta(1/k^3)}= \Theta(k)\).</li>
<li>
<p><strong>Setting D: Infinite features, slow decay.</strong>
Define \(\Sigma\) with \(\lambda_i = \frac{1}{i \log^2(i+1)}\). By approximating series with integrals, we can roughly compute the sums needed for the effective ranks:</p>
\[\sum_{i > k} \lambda_{k+1} \approx \int_{k}^{\infty} \frac{1}{x \log^2(x)} dx = -\frac{1}{\log x} \bigg\lvert_{k}^{\infty} = \frac{1}{\log k}.\]
\[\sum_{i > k} \lambda_{k+1}^2 \approx \int_{k}^{\infty} \frac{1}{x^2 \log^4(x)} dx = \Theta\left(\frac{1}{k \log^4 k}\right).\]
<p>Thus, \(r_{k}(\Sigma) = \Theta(k \log k)\) and \(R_{k}(\Sigma) = \Theta(k \log^2 k)\).
This is the first case that will give us the kind “benign overfitting” that we’re looking for.</p>
</li>
<li>
<p><strong>Setting E: Finite features, two tiers of importance.</strong>
Let \(p = n \log n\) and</p>
\[\lambda_i = \begin{cases}
1 & \text{if } i \leq \frac{n}{\log n}, \\
\frac{1}{\log^2(n)} & \text{if } i > \frac{n}{\log n}.
\end{cases}\]
<p>Then,</p>
\[r_k(\Sigma) = \begin{cases}
\Theta\left(\frac{n}{\log n}\right) & \text{if } k < \frac{n}{\log n}, \\
p - k & \text{if } k \geq \frac{n}{\log n}, \\
\end{cases}.\]
<p>For \(k \geq \frac{n}{\log n}\), \(R_k(\Sigma) = p-k\) as well.</p>
</li>
</ul>
<p><img src="/assets/images/2021-07-11-bllt19/setting.jpeg" alt="" /></p>
<h2 id="the-main-result">The main result</h2>
<p>This section includes the main upper-bound on risk given by Theorem 4.
From this theorem, one can derive sufficient conditions for benign overfitting and non-vacuous error bounds.</p>
<p>First, we define \(k^* = \min\{k \geq 0: r_k(\Sigma) \geq bn\}\) for some constant \(b\).
If there is no such \(k\), let \(k^* = \infty\).
You can think of this as denoting a separation between “high-impact” and “low-impact” coordinates of \(x\).
As stated before, \(r_{k}(\Sigma)\) is small when the variances of coordinates following \(x_{k+1}\) decay rapidly, which means that \(x_{k+1}\) has much larger impact on \(y\) than the following coordinates.
Thus, we won’t meet the condition \(r_k(\Sigma) \geq bn\) until there are roughly \(n\) coordinates following \(x_{k+1}\) that have a similar impact on \(y\) to \(x_{k+1}\).</p>
<p>Then, the following bound on risk holds with probability 0.99 over the sample \((x_1, y_1), \dots, (x_n, y_n)\):</p>
\[R(\hat{\theta}) \leq O\left(\|\theta^*\|^2 \lambda_1\left( \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{r_0(\Sigma)}{n}\right) + \sigma^2\left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right) \right).\]
<p><em>Simplification #3: The bound is actually shown to hold with probability \(1-\delta\), and hence the right-hand-side also includes \(\log \frac{1}{\delta}\) terms.</em></p>
<p>To get a sense for what this result means and when it provides a non-vacuous bound, we evaluate it on our five settings.
We assume for simplicity that \(\|\theta^*\| = O(1)\) and \(\sigma = O(1)\).</p>
<ul>
<li>
<p><strong>Setting A.</strong> As long as \(p \geq bn\), \(k^* = 0\). Then, we have that</p>
\[R(\hat{\theta}) \leq O\left(\sqrt{\frac{p}{n}} + \frac{p}{n} + 0 + \frac{n}{p} \right) = O\left(\frac{p}{n}\right).\]
<p>Because \(p \geq bn\), this bound is no good, since it does not become small as \(n\) grows.
Thus, we cannot guarantee benign overfitting when the components all have equal influence on \(y\).
Similarly impactful features leads to a very large effective rank of the entire matrix \(\Sigma\), which blows up the first two terms.</p>
</li>
<li>
<p><strong>Setting B.</strong> Because \(r_{k}(\Sigma) = 2\) for all \(k\), there is no choice of \(k\) with \(r_k(\Sigma) \geq bn\) (when \(n\) is large, which it should be) and \(k^* = \infty\). The \(\frac{k^*}{n}\) term renders the bound completely useless.</p>
</li>
<li>
<p><strong>Setting C.</strong> In this case, \(k^* = \Theta(n)\) and \(R_{k^*}(\Sigma) = \Theta(n)\). This yields the following risk bound:</p>
\[R(\hat{\theta}) \leq O\left( \sqrt{\frac{1}{n}} + \frac{1}{n} + 1 + 1 \right) = O(1).\]
<p>Again, this bound is vacuous, since it does not approach zero as \(n\) increases.
Too sharp of a decay in variance of coordiantes of \(x\) leads to too large a choice of \(k^*\), which leads the final two terms to not decay to zero.
There must be fewer than \(n\) “significant features” to keep the third term from staying large.</p>
<p><em>Note: Clearly, a rate of \(O(\frac{1}{\log n})\) is not the greatest thing in the world, since it will decay very slowly as \(n\) grows. However, we’re primarily interested in the asymptotic case right now, asking whether the model trends towards zero excess risk as \(n\) becomes arbitrarily large, so this is okay for this context.</em></p>
</li>
<li>
<p><strong>Setting D.</strong> \(k^* = \Theta(n / \log n)\), so \(R_{k^*}(\Sigma) = \Theta(n \log n)\). Plugging this in gives the first non-trivial bound on risk:</p>
\[R(\hat{\theta}) \leq O\left(\sqrt{\frac{1}{n}} + \frac{1}{n} + \frac{1}{\log n} + \frac{1}{\log n} \right) = O\left(\frac{1}{\log n}\right).\]
<p>It’s apparent that the bound <em>can</em> guarantee a risk that approaches zero as \(n\) approaches infinity in the infinite-dimensional regime as long as the variances decay just slowly enough to have their sum diverge. (If \(\lambda_i = \frac{1}{i}\), then the \(r_0(\Sigma) = \infty\) because the sum of the diagonals of \(\Sigma\) will diverge.)</p>
</li>
<li>
<p><strong>Setting E.</strong> \(k^* = \frac{n}{\log n}\) and \(R_{k^*} = \Theta(n \log n)\). Then,</p>
\[R(\hat{\theta}) \leq O\left(\sqrt{\frac{1}{\log n}} + \frac{1}{\log n} + \frac{1}{\log n} + \frac{1}{\log n} \right) = O\left(\frac{1}{\log n}\right).\]
<p>This gets a similar error bound to Setting D, without requiring infinitely many features.</p>
</li>
</ul>
<p>As illustrated by these examples, this bound imposes several conditions that need to be met for benign overfitting to occur:</p>
<ol>
<li>Some of the components of \(x\) must have higher influence than the others, which is necessary to bound the effective rank of the entire matrix \(\Sigma\). That is, we need \(r_0(\Sigma) = \frac{1}{\lambda_1}\sum_{i=1}^p \lambda_i = o(n)\). Setting A fails this condition.</li>
<li>There must be some separation between “high-impact” and “low-impact” coordinates at \(k^*\), where the low-impact coordiantes have high effective rank relative to \(k^*\). For the bound to not be vacuous, we need \(k^* = o(n)\).
In other words, there must be a small number of high-impact coordinates followed by a large number of low-impact coordiantes of similar importance. Settings B and C have too low an effective rank relative to \(k^*\), \(r_{k^*}(\Sigma)\).</li>
<li>The other metric of effective rank must be strictly larger than n: \(R_{k^*}(\Sigma) = \omega(n)\). Settings B and C also fail this condition.</li>
</ol>
<p>The next section discusses how this result is proved and how the above conditions come to be.</p>
<h2 id="proof-techniques-for-the-main-result">Proof techniques for the main result</h2>
<p>As was seen in BHX19 last week, the bound on the risk is proved by first decomposing \(R(\hat{\theta})\) into several terms and then bounding each of those terms.
Unlike BHX19, this paper’s main result is a bound that holds with high probability, rather than in expectation.
As a result, most of the building blocks of this proof will be <em>concentration bounds</em>, which show that certain random variables are very close to their expectations with high probability.</p>
<p>They give the risk decomposition in Lemma 7: With probability \(0.997\) over the training data,</p>
\[R(\hat{\theta}) \leq 2\|\Sigma^{1/2}(I - X^T(XX^T)^{-1}X)\theta^*\|^2 + O(\sigma^2 \text{tr}(C)),\]
<p>where and \(C = (XX^T)^{-1} X \Sigma X^T (X X^T)^{-1}\).
If the term “bias-variance decomposition” means anything to you, that’s what’s happening here:
The first term represents the <em>bias</em> of the best-possible classifier given the data, while the second term corresponds to the variance of the classifier, given the fact that the labels are affected by noise \(\epsilon\).</p>
<p>The proof of this statement occurs in the appendix and is a fairly standard argument, not unlike what was seen in BHX19 last week.</p>
<h3 id="bias-term">Bias term</h3>
<p>Before bounding the bias term, we break it down into more manageable pieces to understand why this corresponds to the bias of the model.</p>
<ul>
<li>\(X^T (X X^T) X \theta^*\) is the projection of the true parameter vector \(\theta^* \in \mathbb{R}^p\) onto the span of \(x_1, \dots, x_n \in \mathbb{R}^p\).
That is, it’s the linear combination \(v = \sum_{i=1}^n a_i x_i\) minimizing \(\|v - \theta^*\|\).</li>
<li>\((I - X^T(XX^T)^{-1}X)\theta^*\) then corresponds to \(v - \theta^*\), so this vector has higher magnitude if the projection of \(\theta^*\) onto the span of the rows of \(X\) is far from \(\theta^*\).
In general, as \(p\) becomes proportionally larger than \(n\), this vector will become larger because an \(n\)-dimensional subspace will make up a very small subset of \(\mathbb{R}^p\).</li>
<li>\(\Sigma^{1/2}(I - X^T(XX^T)^{-1}X)\theta^*\) rescales this vector based on the variances of the coordinates.
In other words, this transform down-weights the components of the parameter vector that correspond to “low-impact” coordiantes that will be small anyways.</li>
<li>\(\|\Sigma^{1/2}(I - X^T(XX^T)^{-1}X)\theta^*\|^2\) obtains the squared magnitude of the preceding vector.</li>
</ul>
<p>Since \(\hat{\theta} = X^T (X X^T)^{-1} y\), it must be the case that \(\hat{\theta}\) lies in the span of the rows of \(X\) as well.
Thus, the above quantity is an upper-bound on how closely \(\hat{\theta}\) can correspond to \(\theta^*\) given these restrictions the space where \(\hat{\theta}\) can lie.</p>
<p>Lemma 35 bounds the quantity with probability 0.997 by applying standard concentration bounds with the bounds on the effective rank of \(\Sigma\) captured by \(r_0(\Sigma)\):</p>
\[\|\Sigma^{1/2}(I - X^T(XX^T)^{-1}X)\theta^*\|^2 = O\left( \|\theta^*\|^2 \lambda_1 \left(\sqrt{\frac{r_0(\Sigma)}{n}} + \frac{r_0(\Sigma)}{n} \right) \right).\]
<h3 id="variance-term">Variance term</h3>
<p>The matrix \(C = (XX^T)^{-1} X \Sigma X^T (X X^T)^{-1}\) is a bit difficult to make sense of.
Roughly, the trace of this matrix will be small when conditions (2) and (3) are met: the existence of many coordinates of \(x\) of comparable variances.
In that case, we don’t expect to be hurt much by the noise \(\epsilon\) because it will distribute relatively easily among the comparable coordiantes.
If they are <em>not</em> comparable, then the effect of the noise cannot “average out” by being dispersed over a lot of similar coordinates.
Instead, the noise will dominate the coordinates with low variance, while the coordinates with high variance will not be numerous enough to prevent the noise from corrupting the population of high variance coordinates.</p>
<p>This intuition is encoded by Lemma 11, which bounds its trace with probability 0.997 based on the diagonals of \(\Sigma\):</p>
\[\text{tr}(C) = O\left( \frac{k^*}{n} + \frac{n \sum_{i > k^*} \lambda_i^2}{(\sum_{i > k^*} \lambda_i)^2} \right) = O\left( \frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right).\]
<p>This is proved by a collection of lemmas that apply concentration bounds to matrix products and facts about eigenvalues.</p>
<h3 id="proof-conclusion">Proof conclusion</h3>
<p>We can put these pieces together by using a union bound.
Each the three inequalities hold with probability 0.997, which means the probability of any of them failing is at most 0.009.
Thus, the theorem statement, which comes from combining them, must hold with probability 0.991.</p>
<h2 id="last-thoughts">Last thoughts</h2>
<p>On a high-level, this paper proves that “benign overfitting” occurs for a narrow sliver of covariance matrices \(\Sigma\), whose variances decay slowly.
This complements BHX19, which show a similar notion of benign overfitting, but instead consider models that exclude components of the data from the learner.</p>
<p>Both models suggest the existence of a large number of weak features is necessary for this phenomenon to occur.
BHX19 highlights that “scientific feature selection”—where the highest-impact features are chosen by the learner—negates the need for over-parameterization to have bounded risk.
That appears to be true here as well. A large number of weak features are needed for these bounds to hold; however, a “scientific” approach could allow the model to operate well by only using the strong features of their impacts are know a priori.</p>
<p>Next time, I’ll write about <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a>, which is another similar paper that proves similar statements to BLLT19, but from a somewhat different angle that focuses on describing an intuition for why the variances must decay in a particular way.
Put together, the three papers characterize a range of peculiar instances (e.g. misspecified data, slowly decaying components) where an over-parameterized approach does better than the classical literature suggests.</p>Clayton Sanford[OPML#1] BHX19: Two models of double descent for weak features2021-07-05T00:00:00+00:002021-07-05T00:00:00+00:00http://blog.claytonsanford.com/2021/07/05/bhx19<!-- [BHX19](https://arxiv.org/abs/1903.07571){:target="_blank"} [[OPML#1]](/2021/07/05/bhx19.html){:target="_blank"} -->
<p><em>This is the first of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading and some notation that I’ll often refer back to.</em></p>
<p>This week’s summary will cover <a href="https://arxiv.org/abs/1903.07571">“Two models of double descent for weak features”</a> by Mikhail Belkin, Daniel Hsu (my advisor!), and Ji Xu (a recently graduated student of Daniel’s), which gives clean examples of when the double-descent phenomenon occurs for linear regression problems.
For a high-level overview of double descent, check out the <a href="/2021/07/04/candidacy-overview.html" target="_blank">introductory post</a> for this series, which gives a brief summary of the intuition for this phenomenon with some visuals.
This paper was released concurrently with <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>, <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a>, and <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a>, which prove the occurrence of a similar phenomenon with slightly different setups.
I’ll write about those two papers for the next two posts, which will cover the more subtle differences between the three papers.</p>
<p>Under certain circumstances, linear regression models with more parameters than samples (“over-parameterized models”) outperform models with fewer parameters.
As discussed in the overview post, this flies in the face of classical statistical intuition, where the prevailing idea is that a model can only perform well on never-before-seen samples if the model is simple enough to not overfit the data.
They demonstrate this phenomenon for two different regression problems, which they refer to as the <em>Gaussian model</em> and the <em>Fourier model</em>.
We’ll focus on the former in this summary.</p>
<p>The contributions of their paper are roughly summarized by the following plot (taken from their Figure 1), which gives the “double-descent” curve that they prove in their setting.
<img src="/assets/images/2021-06-29-bhx19/double-descent.jpeg" alt="" />
In this example, they consider a setting where \(n = 40\) samples \((x_1, y_1), \dots (x_n, y_n) \in \mathbb{R}^{D} \times \mathbb{R}\) for \(D = 100\) are drawn.
The <em>least-squares</em> linear regression algorithm is used to learn the best linear learning rule using only \(p\) out of \(D\) components of each sample.
As \(p\) increases from 0 to \(n = 40\), the performance of the linear learning rule worsens as the model overfits to the data more dramatically.
This corresponds to the “classical regime” of double-descent, albeit one whose “sweet spot” is at \(p = 0\) and hence never actually experiences the first of the two descents.</p>
<p>The strong performance is because of a peculiarity of the data models used by this paper.
In more realistic settings, we’d expect that there <em>should</em> be a proper descent in the classical regime.
The next section discusses why this model behaves like that.</p>
<p>The interesting behavior occurs when \(p > n\) and the expected risk (which is the mean squared error in this problem setting) of the learning rule improves as \(p\) continues to grow.
Here, all of the training samples are able to be perfectly fit by the linear learning rule, and the addition of more features as \(p\) grows beyond \(n\) allows the learning rule to become less volatile and reap the benefits of over-parameterizaton without suffering from the consequences.</p>
<p>This all is very high-level—it’s not clear from the above description how “risk” is defined, how the samples are drawn, and how “volatility” can be quantified.
The next section discusses the Gaussian setting in detail and proves that this phenomenon holds in that case.</p>
<h2 id="the-gaussian-model">The Gaussian model</h2>
<p>Their model draws labeled samples \((x, y) \in \mathbb{R}^d \times \mathbb{R}\) using the following procedure, for some fixed true parameter vector \(\beta \in \mathbb{R}^d\) and noise parameter \(\sigma > 0\).</p>
<ul>
<li>Every component of \(x\) is drawn independently from a standard Gaussian distribution; equivalently, we say that \(x \sim \mathcal{N}(0, I_D)\).</li>
<li>Noise \(\epsilon\) is drawn from a Gaussian distribution: \(\epsilon \sim \mathcal{N}(0, 1)\).</li>
<li>The label \(y\) is determined by combining a “ground truth label” \(x^T \beta\) and noise \(\sigma \epsilon\): \(y = x^T \beta + \sigma \epsilon\).</li>
</ul>
<p>The goal for the learner is to choose some hypothesis parameter vector \(\hat{\beta} \in \mathbb{R}^D\) that has a small expected squared error on unknown data:
\(\mathbb{E}_{(x,y)}[( y - x^T \hat{\beta})^2]\).
(This is the <em>population loss</em> in this setting.)</p>
<p>So far, we have not discussed the role of \(p\), the number of parameters the learner uses to express \(\hat{\beta}\).
This model incorporates \(p\) by giving the learner access to only \(p\) out of \(D\) components of each samples.
That is, for some subset \(T \subseteq [D] := \{1, \dots, D\}\) with \(|T| = p\), the learner is given access to samples \(((x_{1, T}, y_1), \dots, (x_{n, T}, y_n)) \in \mathbb{R}^p \times \mathbb{R}\), where \(x_{i,T}\) is a vector of length \(p\) consisting of elements \(x_{i, j}\) where \(j \in T\).</p>
<p>Because the learner does not have access to the the remaining elements \(x_{i, T_c}\), it must come up with the best possible learning rule \(\hat{\beta}\) on the training data that incorporates only those elements.
The learning algorithm does so by choosing \(\hat{\beta}_T\) to get the best fit the training inputs to their labels according to the squared error and letting \(\hat{\beta}_{T_c} = 0\).
Specifically, the algorithm chooses \(\hat{\beta}_T = X_T^{\dagger} y\) (where \(X_T = [x_{1, T}, \dots, x_{n, T}] \in \mathbb{R}^{n \times p}\), \(y = (y_1, \dots, y_n) \in \mathbb{R}^n\) and \(A^{\dagger}\) is the <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse">pseudoinverse</a> of \(A\)), which find the lowest-norm parameter vector that best fits the samples:</p>
<ol>
<li>If the training data cannot be perfectly fit (which happens almost always when \(p\leq n\), \(\hat{\beta}_T\) is the parameter vector that best fits the samples: \(\hat{\beta}_T = \arg \min_{\beta_T'} \sum_{i=1}^n (x_{i, T}^T \beta_T' - y_i)^2\).</li>
<li>Otherwise, if there is some \(\beta_T'\) such that \(X_T \beta_T' = y\) (or \(x_{i, T}^T \beta_T' = y_i\) for all \(i\)), then \(\hat{\beta}_T\) is the parameter vector with minimum norm \(\|\beta_T'\|_2\) of all vectors with that property.</li>
</ol>
<p>As we’ll see, case (1) corresponds to the “classical regime” (or the left side of the above curve) and case (2) corresponds to the “interpolation regime” (right side).</p>
<p>Based on the definition of the Gaussian model, we describe three settings which we’ll return to later to illustrate how the results work and understand what their limitations are.</p>
<ul>
<li><strong>Setting A: All features are equally important.</strong> \(\beta = (D^{-1/2}, \dots, D^{-1/2})\) and \(T = [p]\).
Here, all of the features have equal weight, so we can expect to only have a \(\frac{p}{D}\) proportion of the “useful information” by seeing \(x_{i, T}\), rather than \(x_i\).</li>
<li><strong>Setting B: Some features are more important than others, and we have access to the most informative features.</strong> \(\beta_j = \frac{c}{j}\) for bounded \(c =O(1)\) such that \(\|\beta\|_2 = 1\) and \(T = [p]\).
Features with low-indices are much more valuable for the learner to have access to than the others, so small values of \(p\) will still provide the learner with useful information.</li>
<li><strong>Setting C: Some features are more important than others, but we only get a random selection of the features.</strong> \(\beta_j = \frac{c}{j}\) such that \(\|\beta\|_2 = 1\) and \(T \subset [D]\) is uniformly drawn from all subsets of size \(p\).
While low-index features are more valuable, we can’t guarantee that we’ll have those, so the theoretical guarantees will more closely resemble setting A than setting B.</li>
</ul>
<p>One of the strange parts of this setting—which we’ll observe when the theoretical results are applied to setting A—is that the “under-parameterized” classical regime never performs well because very little information is provided at all when \(p\) is small.
This means that in the eye of the learner, the data is <em>very</em> noisy, because the learner knows nothing about the \(x_{T_c}^T \beta_{T_c} + \sigma \epsilon\) components of \(y\). (\(T_c = [D] \setminus T\) is the complement of \(T\).)
As a result, it’s a little bit of an “unfair” setting, where no one can reasonably expect good performance when \(p \ll D\).
As is the case with many results in this space, the double-descent phenomenon requires certain peculiarities of the learning models to be studied in order to be cleanly demonstrated.</p>
<h3 id="the-main-result">The main result</h3>
<p>Theorem 1 gives exactly the expected risk for the learning rule obtained by using least-squares linear regression on the Gaussian model:</p>
\[\mathbb{E}_{(x, y)}[(y - x^T \hat{\beta})^2] = \begin{cases}
(\|\beta_{T_c}\|^2 + \sigma^2) (1 + \frac{p}{n - p - 1}) & \text{if } p \leq n -2;\\
\infty & \text{if } p \in [n-1, n+1]; \\
\|\beta_T\|^2 (1 - \frac{n}{p}) + (\|\beta_{T_c}\|^2 + \sigma^2) (1 + \frac{n}{p - n - 1}) & \text{if } p \geq n+2.
\end{cases}\]
<p>To make this concrete, we compute the risk for each of the three example settings given above.</p>
<ul>
<li>
<p><strong>Setting A:</strong> Because the coordinates of \(\beta\) are identical, we can simplify the bound by noting that \(\|\beta_T\|^2 = \frac{p}{D}\) and \(\|\beta_{T_c}\|^2 = 1-\frac{p}{D}\).</p>
<p>\(\mathbb{E}_{(x, y)}[(y - x^T \hat{\beta})^2] = \begin{cases}
(1 - \frac{p}{D} + \sigma^2) (1 + \frac{p}{n - p - 1}) & \text{if } p \leq n -2;\\
\infty & \text{if } p \in [n-1, n+1]; \\
(1 - \frac{n}{p} (2 - \frac{D - n - 1}{p - n -1})) + \sigma^2 (1 + \frac{n}{p - n - 1}) & \text{if } p \geq n+2.
\end{cases}\)
This follows the same pattern as the above plot.
As \(p\) increases in size when \(p \leq n - 2\), the second factor diverges to infinity faster than the first term can decrease, leading to the worsening in performance as the model does better at fitting the training data.
When \(p \geq n+2\), increasing \(p\) decreases both terms, which gives the descent in the interpolation regime.</p>
</li>
<li>
<p><strong>Setting B:</strong>
We analyze this setting very roughly, sacrificing precision to explain why a different kind of double-descent curve occurs here.
\(\|\beta_{T_c}\|^2\) can be roughly approximated as follows:</p>
\[\|\beta_{T_c}\|^2 = \sum_{j=p+1}^D \frac{c^2}{j^2} \approx \int_{p}^d \frac{c^2}{z^2} dz = -\frac{c^2}{z} \bigg\lvert_{p}^d = \frac{c^2}{p} - \frac{c^2}{d} = \Theta\left(\frac{1}{p}\right).\]
<p>Now, we instead get the following risk, expressed in asymptotic notation:</p>
\[\mathbb{E}_{(x, y)}[(y - x^T \hat{\beta})^2] = \begin{cases}
\Theta((\frac{1}{p} + \sigma^2) (1 + \frac{p}{n - p - 1})) & \text{if } p \leq n -2;\\
\infty & \text{if } p \in [n-1, n+1]; \\
\Theta((1 - \frac{n}{p}) + (\frac{1}{p} + \sigma^2) (1 + \frac{n}{p - n - 1})) & \text{if } p \geq n+2.
\end{cases}\]
<p>The authors plot the risk of this curve:
<img src="/assets/images/2021-06-29-bhx19/double-descent-choice.jpeg" alt="" /></p>
<p>This tells a slightly different story.
Because the risk also approaches \(\infty\) as \(p\) approaches zero, there is a now a “sweet spot” where the risk is minimized on the left side of the curve.
This resembles a more “traditional” descent curve, where double descent occurs, but where the risk is higher in the interpolation regime.
The authors explain that this difference is accounted for by a “scientific” feature selection model, which means that the benefits of interpolation are only fully reaped when the algorithm designer does not have the ability to cherry-pick the most informative features.
In other words, it’s possible to obtain good model with few features in the classical regime if we can ensure that the chosen features have more bearing on labels \(y\) than the other features.</p>
</li>
<li>
<p><strong>Setting C:</strong>
This setting has an identical expected risk to that of Setting A due to the random feature selection, since \(\mathbb{E}[\|\beta_T\|^2] = \frac{p}{D}\).
Therefore, the extreme double-descent case detailed in that setting can still occur in cases where different components of \(x\) have orders of magnitude of impact on \(y\) as long as the most informative features cannot be deliberately chosen.
This illustrates that the benefits of Setting B can only be reaped when the algorithm designer can “scientifically” choose the best features.</p>
</li>
</ul>
<p>A notable weakness of Theorem 1 is that the results are about the expected squared loss, rather a high-probability guarantee about what the risk will actually be.
Theorem 2 offers an improvement by giving concentration bounds on \(\|\beta - \hat{\beta}\|^2\).
We won’t go into that in this blog post, but these kinds of bounds will be seen in other papers discussed in future posts.</p>
<h3 id="proof-techniques">Proof techniques</h3>
<p>The proof of Theorem 1 can be broken down into several manageable steps, which this section will summarize at a high level.
This part will be somewhat more jargon-y than the rest of the blog post, so feel free to skim it if it’s not of interest.</p>
<p>Unlike other proofs we’ll see later on, this proof primarily relies on linear algebraic tricks related to orthogonality to exactly compute the expected value of various norms.
There is no need for much in the way of probabilistic trickery, because this bound holds in expectation rather than with high probability.</p>
<p>To prove the bound, the expected risk can be partitioned into three distinct terms by expanding the square, plugging in \(y = x^T \beta + \sigma \epsilon\), and noting that \(\hat{\beta}_{T_c} = 0\):</p>
\[\mathbb{E}[(y - x^T \hat{\beta})^2] = \sigma^2 + \|\beta_{T_c}\|^2 + \mathbb{E}[\| \beta_T - \hat{\beta}_T\|^2].\]
<p>This tells us that all error for this problem must come from one of three sources, each corresponding to a term: (1) the noisy component of \(y\), \(\sigma \epsilon\); (2) the components of the parameter vector that cannot be determined due to the learner’s ignorance of \(x_{T_c}\), \(\|\beta_{T_c}\|^2\); and (3) the gap between the true parameters and the estimated parameters on the components that the learner is provided, \(\mathbb{E}[\| \beta_T - \hat{\beta}_T\|^2]\).</p>
<p>It suffices to analyze the third term, which can also be written as \(\mathbb{E}[\| \beta_T - X_T^\dagger y\|^2]\).
The analysis then splits into two directions: one for the case where \(p \leq n\) and the other for \(p > n\), which is the case because the psuedoinverse \(X_T^\dagger\) is defined as \(X_T^T(X_T X_T^T)^{-1}\) if \(X_T \in \mathbb{R}^{n \times p}\) is a “hot dog” matrix with \(n \leq p\) and \((X_T^T X_T)^{-1} X_T^T\) if \(X_T\) is a “hamburger” matrix with \(n \geq p\).</p>
<p><em>Note: Because we’re dealing with Gaussian data, we don’t need to worry about issues related to the matrix \(X\) not being full-rank.
The \(n\) samples will almost surely span a space of dimension \(n\) if \(n \leq p\) and \(p\) otherwise.
If we were drawing samples from a discrete distribution (e.g. uniform from \(\{-1, 1\}^D\), then we’d need to consider the event where the samples are linearly dependent.</em></p>
<p>We only consider the interpolation case with \(p > n\) here, because the classical case has been well-understood for decades, and the authors refer readers to older works.
Given the definition of the pseudoinverse, the difference between the two weight vectors can be decomposed into two terms:</p>
\[\beta_T - \hat{\beta}_T = (I - X_T^T(X_T X_T^T)^{-1}X_T)\beta_T - X_T^T(X_T X_T^T)^{-1} \eta,\]
<p>where \(\eta = y - X_T \beta_T\).
Note that the two terms must be orthogonal to one another:</p>
<ul>
<li>The first term can be written as \(\beta_T - \Pi_T \beta_T\), where \(\Pi_T\) is an orthogonal projection operator onto the rowspace of \(X_T\). Thus, this vector must lie in the null space of \(X_T\).</li>
<li>The second must lie in the row space of \(X_T\), since it includes a multiplication by \(X_T\).</li>
</ul>
<p><img src="/assets/images/2021-06-29-bhx19/orth.jpeg" alt="" /></p>
<p>Therefore, the two terms are orthogonal, which means that the Pythagorean theorem can be used to break down the squared norm into two terms:</p>
\[\|\beta_T - \hat{\beta}_T\|^2 = \|\beta_T - \Pi_T \beta_T\|^2 + \|X_T^T(X_T X_T^T)^{-1} \eta\|^2.\]
<p>The first term can then be broken up into \(\|\beta_T\|^2 - \|\Pi_T \beta_T\|^2\), again by the Pythagorean Theorem.
Applying an expectation, we get \(\mathbb{E}[\|\beta_T\|^2 - \|\Pi_T \beta_T\|^2] = (1 - \frac{n}{p}) \|\beta_T\|^2\).</p>
<p>The second term can be shown to have an expectation of \((\|\beta_{T_c}\|^2 + \sigma^2) \frac{n}{p-n-1}\) by using properties of the <a href="https://en.wikipedia.org/wiki/Inverse-Wishart_distribution">Inverse-Wishart distribution</a>.</p>
<p>Plugging these pieces into the initial decomposition of the expected risk gives the theorem.</p>
<h2 id="fourier-series-model">Fourier series model</h2>
<p>The second part of their results in Section 3 focus on a model where the samples \(x\) are rows of a Fourier transform matrix, which are orthogonal to one another.
As before, the authors only give a \(p\) out of the \(D\) dimensions of each row to make the model under-specified.
They’re similarly able to show a sharp difference between the classical and interpolation regimes, with a risk curve resembling the plot at the beginning of this post.
Unlike the Gaussian case, these results hold in the limit, as \(n\), \(p\), and \(D\) all go to infinity, but the ratios \(\rho_n = \frac{n}{D}\) and \(\rho_p = \frac{p}{D}\) are kept fixed.</p>
<h2 id="future-directions--unanswered-questions">Future directions / unanswered questions</h2>
<p>The key contribution of this paper was to show the existence of a simple setting where the least-squares linear regression algorithm exhibits double-descent and performs best when the number of model parameters is much larger \(p\) than the number samples \(n\).
The simplicity of this paper’s setting leaves other questions open about how broad this phenomenon extends beyond these toy examples.
The following questions about the generality of the results can be posed:</p>
<ul>
<li>Do interpolating models only succeed in “misspecified” settings like this one, where the learner is only given access to a small fraction of the relevant features?</li>
<li>How do these results extend to data distributions that are not Gaussian? (e.g. what if we instead assume that components of \(x\) have subgaussian tails, or if there can be some dependence between components? What if \(\epsilon\) is not necessarily drawn from a Gaussian distribution?)</li>
<li>Is there something special about mean squared error, or does this phenomenon also occur when different loss functions are used?</li>
<li>Will the best results always be found in the classical regime when “scientific feature selection” is used?</li>
</ul>
<p><em>Thanks for reading this blog post! If you have any feedback, feel free to comment it below or email me! All feedback is appreciated. Stick around for more on over-parameterization and when it provably works.</em></p>Clayton Sanford[OPML#0] A series of posts on over-parameterized machine learning models2021-07-04T00:00:00+00:002021-07-04T00:00:00+00:00http://blog.claytonsanford.com/2021/07/04/candidacy-overview<p><em>Hello, and welcome to the blog! I’ve been wanting to start this for awhile, and I’ve finally jumped in. This is the introduction to a series of blog posts I’ll be writing over the course of the summer and early in the fall. I hope these posts are informative, and I welcome any feedback on their technical content and writing quality.</em></p>
<p>I’ve recently finished the second year of my computer science PhD program at Columbia, in which I study the overlap between theoretical comptuer science and machine learning.
Over the next few months, I’m going to read a lot of papers about <em>over-parameterized</em> machine learning models—which have a much larger number of parameters than the number of samples.
I’ll write summaries of them on the blog, with the goal of making this line of work accessible to people in and out of my research community.</p>
<h2 id="why-am-i-doing-this">Why am I doing this?</h2>
<p>While reading papers is a lot of fun (sometimes), I kind of am required to read this set.
CS PhD students at Columbia must take a <a href="https://www.cs.columbia.edu/education/phd/requirements/candidacy/" target="_blank">candidacy exam</a> sometime in their third (or occasionally fourth) year, which requires the student to read 20-25 papers in their subfield in order to better understand what is known and what questions are asked in their research landscape.
It culminates with an oral examination, where several faculty members question the student on the papers and future research directions in the subfield.</p>
<p>I’m starting the process of reading the papers now, and I figured that it wouldn’t be such a bad idea to write about what I learn, so that’s what this is going to be.</p>
<h2 id="why-this-research-area">Why this research area?</h2>
<blockquote>
<p>Also, what even is an “over-parameterized machine learning model?”</p>
</blockquote>
<p>The core motivation of all of my graduate research is to understand why deep learning works so well in practice.
For the uninitiated, deep learning is a family of machine learning models that uses complex hierarchical neural networks to represent complicated functions.
(If you’re more familiar with theoretical CS than machine learning, you can think of a neural network as a circuit with continuous inputs and outputs.)
In the past decade, deep learning has been applied to tasks like object recognition, language translation, and game playing with wild degrees of success.
However, the theory of deep learning has lagged far behind these practical successes.
This means that we can’t answer simple questions like the following in a mathematically precise way:</p>
<ul>
<li>“How well do we expect this trained model to perform on new data?”</li>
<li>“What are the most important factors in determining how a sample is classified?”</li>
<li>“How will changing the size of the model affect model performance?”</li>
</ul>
<p>ML theory researchers have formulated those questions mathematically and produced impressive results about performance guarantees for broad categories of ML models.
However, these don’t apply very well to deep learning.
In order to get a mathematical understanding of the kinds of questions these researchers ask, I define some terminology about ML models, which I refer back to when describing why theoretical approaches tend to fall short in for deep neural networks.</p>
<p>As a motivating example, consider a prototypical toy machine learning problem: training a classifer that distinguishes images of cats from images of dogs.</p>
<ul>
<li>One does so by “learning” a function \(f_\theta: \mathbb{R}^d \to \mathbb{R}\) that takes as input the values of pixels of a photo \(x\) (which we can think of as a \(d\)-dimensional vector) and has the goal of returning \(f_\theta(x) = 1\) if the photo contains a dog and \(f_\theta(x) = -1\) if the photo contains a cat.</li>
<li>
<p>In particular, if we assume that the pixels \(x\) and labels \(y\) are drawn from some probability distribution \(\mathcal{D}\), then our goal is to find some parameter vector \(\theta \in \mathbb{R}^p\) such that the <em>population error</em> \(\mathbb{E}_{(x, y) \sim \mathcal{D}}[\ell(f_\theta(x), y)]\) is small, where \(\ell: \mathbb{R} \times \mathbb{R} \to \mathbb{R}_+\) is some <em>loss function</em> that we want to minimize.
(For instance, the <em>squared loss</em> is \(\ell(\hat{y}, y) = (\hat{y} - y)^2\).)</p>
<p><em>Notationally, we let \(\mathbb{R}_+\) represent all non-negative real numbers, and \(\mathbb{E}\) be the expectation, where \(\mathbb{E}_{x \sim \mathcal{D}}[g(x)] = \sum_{x} \text{Pr}_{x'\sim \mathcal{D}}[x' = x] g(x).\)</em></p>
</li>
<li>\(f_\theta\) is parameterized by the \(p\)-dimensional vector \(\theta\), and <em>training</em> the model is the process of choosing a value of \(\theta\) that we expect to perform well and have small population error.
In the case of deep learning, \(f_\theta\) is the function produced by computing the output of a neural network with connection weights determined by \(\theta\).
In simpler models like linear regression, \(\theta\) directly represents the weights of a linear combination of the inputs: \(f_\theta(x) = \theta^T x\).</li>
<li>This training process occurs by observing \(n\) training samples \((x_1, y_1), \dots, (x_n, y_n)\) and choosing a set of parameters \(\theta\) such that \(f_\theta(x_i) \approx y_i\) for all \(i= 1, \dots, n\).
That is, we find a vector \(\theta\) that yields a small <em>training error</em> \(\frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i, y_i))\).
The hope is that the learning rule will <em>generalize</em>, meaning that a small training error will lead to a small population error.</li>
</ul>
<p>This is an over-simplified model that excludes broad categories of ML.
It pertains to batch supervised regression problems, where the data provided are labeled, all data is given at once, and the labels are real numbers.
While there’s a broad array of topics that we could discuss, we focus on this simple setting in order to motivate the line of research without introducing too much complexity.</p>
<h3 id="classical-learning-theory">Classical learning theory</h3>
<p>Over the years, statisticians and ML theorists have studied the conditions necessary for a trained model to perform well on new data.
They developed an elegant set of theories to explain when we should expect good performance to occur.
The core idea is that more complex models require more data; without enough data, the model will pick up only spurious correlations and noise, learning nothing of value.</p>
<p>To think about this mathematically, we decompose the population error term into two terms—the training error and the generalization error—and analyze how they change with the model complexity \(p\) and the number of samples \(n\).</p>
\[\underbrace{\mathbb{E}_{(x, y) \sim \mathcal{D}}[\ell(f_\theta(x), y)]}_{\text{population error}} = \underbrace{\frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i, y_i))}_{\text{training error}} + \underbrace{\left(\mathbb{E}_{(x, y) \sim \mathcal{D}}[\ell(f_\theta(x), y)] - \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i, y_i)) \right).}_{\text{generalization error}}\]
<p>This classical model gives rigorous mathematical bounds that specify when the two components should be small.
The below image represents the core trade-off that this principle implies; you can find something like it in most ML textbooks.
<img src="/assets/images/2021-06-15-candidacy-overview/classical-err.jpeg" alt="" /></p>
<ul>
<li>
<p>If I choose a very small number of parameters \(p\) relative to the number of samples \(n\), then my model will perform poorly because it’s too simplistic.
There will be no function \(f_\theta\) that can classify most of the training data.
The training error will be large for any choice of \(\theta\), even though the generalization error is small.
<img src="/assets/images/2021-06-15-candidacy-overview/samples1.jpeg" alt="" /></p>
</li>
<li>
<p>There’s then a “sweet spot” for \(p\), where it’s large enough to capture the complexity of the data distribution, but not too large to overfit. Here, we have a small training error <em>and</em> a generalization error.
<img src="/assets/images/2021-06-15-candidacy-overview/samples2.jpeg" alt="" /></p>
</li>
<li>
<p>If I choose \(p\) to be large, then I can expect <em>overfitting</em> to occur, where the model has a training error near zero, but the generalization error is very large.
In this setting, the model performs poorly because it only memorizes the data, without actually learning the underlying trend.
<img src="/assets/images/2021-06-15-candidacy-overview/samples3.jpeg" alt="" /></p>
</li>
</ul>
<p>Classical learning theory offers several tools (like <a href="https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension" target="_blank">VC-dimension</a> and <a href="https://en.wikipedia.org/wiki/Rademacher_complexity" target="_blank">Rademacher complexity</a>) to quantify the complexity of a model and provide guarantees about how well we expect a model to perform.
Based on this theory, <em>over-parameterized models</em> (which have \(p \gg n\)) are expected to be the in the Very Bad second regime, with lots of overfitting and a very large generalization error.
However, models of this form often perform much better than this theory anticipates.
As a result, there’s a push to develop new theory that better captures what happens when we have more parameters than samples.</p>
<h3 id="the-gap-between-theory-and-practice">The gap between theory and practice</h3>
<p>The most prominent case where the classical model fails to explain good performance is for deep learning.
Deep neural networks are typically trained with large quantities of data, but they’ll also have more than enough parameters needed to perfectly fit the data; often, there are more parameters than samples.
For instance, the <a href="https://paperswithcode.com/sota/image-classification-on-imagenet" target="_blank">state-of-the-art image classifier</a> (as of July 2021) for the standard ImageNet classification task uses roughly 1.8 billion parameters when trained on roughly 14 million training samples, an over-parameterization by a factor of 100.
Then, they’re typically trained to obtain zero training error—and purposefully overfit the data—using gradient descent.</p>
<p>Standard approaches to proving generalization bounds provide a bleak picture. For instance:</p>
<ul>
<li>The VC-dimension of a neural network with \(p\) different weights, \(L\) layers, and binary output is <a href="https://arxiv.org/pdf/1703.02930.pdf" target="_blank">known</a> to be \(O(p L \log p)\), which implies that it can be learned with \(O(pL \log p)\) samples. This only guarantees success when there are more samples than parameters.</li>
<li><a href="https://arxiv.org/abs/1706.08498" target="_blank">BFT17</a> and <a href="https://arxiv.org/abs/1712.06541" target="_blank">GRS17</a> gives generalization error bounds on deep neural networks based on the magnitudes of the weights. However, these bounds grow exponentially with the depth of the network unless the weights are forced to be much smaller than they are in neural networks that succeed in practice.</li>
</ul>
<p>These approaches give no reason to suspect that the generalization error will be small at all in realistic neural networks with a large number of parameters and unrestricted weights.
But neural networks <em>do</em> generalize to new data in practice, which leaves open the question of why that works.</p>
<p>This gap between application and theory indicates that there are more phenomena that are not currently accounted for in our theoretical understanding of deep learning.
This also means that much of the practical work in deep learning is not informed at all by theoretical principles.
Training neural networks has been dubbed “<a href="https://www.youtube.com/watch?v=ORHFOnaEzPc" target="_blank">alchemy</a>” rather than science because the field is built on best practices that were learned by tinkering and is not understood on a first-principles level.
Ideally, an explanatory theory of how neural networks work could lead to more informed practice and eventually, more interpretable models.</p>
<h3 id="over-parameterization-and-double-descent">Over-parameterization and double-descent</h3>
<p>This series of posts is (mostly) not about deep learning.
Deep neural networks are notoriously difficult to understand mathematically, so most researchers who attempt to do so (including yours truly) must instead study highly simplified variants.
For instance, <a href="https://arxiv.org/abs/2102.02336" target="_blank">my paper</a> about the approximation capabilities of neural networks with random weights only applies to networks of depth two, because anything else is too complex for our methodology to characterize.
So instead of studying deep neural networks, we’ll consider a broader family of ML methods (in particular, linear methods like <em>least-squares regression</em> and <em>support vector machines</em>) when they’re in over-parameterized settings with \(p \gg n\).
My hope is that similar principles will explain both the successes of simple over-parameterized models and more complicated deep neural networks.</p>
<p>Broadly, this line of work challenges the famous curve above about the tradeoffs between model complexity and generalization.
It does so by suggesting that increasing the complexity of a model (or the number of parameters) can lead to situations where the generalization error decreases once more.
This idea is referred to <em>double descent</em> and was popularized by <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">my advisor</a> and his collaborators in papers like <a href="https://arxiv.org/abs/1806.05161">BHM18</a> and <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a>.
This augments what is referred to as the <em>classical regime</em>—where \(p \leq n\) and choosing the right model is equivalent to choosing the “sweet spot” for \(p\)—with the <em>interpolation regime</em>.
Interpolation means rougly the same thing as overfitting; it describes methods that bring the training loss to zero.</p>
<p><img src="/assets/images/2021-06-15-candidacy-overview/double-err.jpeg" alt="" /></p>
<p>In the interpolation regime, \(p \gg n\) and the learning algorithm selects a hypothesis that perfectly fits the training data (i.e. the training error is zero) and is somehow “smooth,” which leads to some other kind of “simplicity” that then yields a good generalization error.
Here’s a way to think about it:</p>
<ul>
<li>When \(p \approx n\), it’s likely that there’s exactly one or very few candidate functions \(f_\theta\) that perfectly fit the data, and we have no reason to expect that this function won’t be overly “bumpy” and fail to learn any underlying pattern. (See image (3) above.)</li>
<li>Instead, if \(p \gg n\), then there will be many hypotheses to choose from that have a training error of zero.
If the algorithm is somehow biased in favor of “smooth” hypotheses, then it’s more likely to pick up on the underlying structure of the data.
<img src="/assets/images/2021-06-15-candidacy-overview/samples4.jpeg" alt="" /></li>
</ul>
<p>Of course, this is a very hand-wavy way to describe what’s going on.
Also, it’s not always the case that having \(p \gg n\) leads to a nice situation like the one in image (4).
This kind of success only occurs when the learning algorithm and training distribution meet certain properties.
Through this series of posts, I’ll make this more precise and describe the specific mechanisms that enable good generalization in these settings.</p>
<h2 id="what-will-be-covered">What will be covered?</h2>
<p>This series aims to have both breadth and depth.
I’ll explore a wide range of settings where over-parameterized models perform well and then hone in on linear regression to understand the literature on a very granular level.
The following are topics and papers that I’ll read and write about.
I’ll add links to the corresponding blog posts once they exist.</p>
<p>This list is subject to change, especially in the next few weeks.
If you, the reader, think there’s anything important missing, please let me know!</p>
<ul>
<li><strong>Double-descent in linear models:</strong>
These papers — which study over-parameterization in the one of the most simple domains possible — will be the main focus of this survey.
<ul>
<li><em>Least-squares regression:</em>
<ul>
<li><strong><a href="https://link.springer.com/article/10.1007/s10208-006-0196-8" target="_blank">CD07</a>.</strong> Caponnetto and De Vito. “Optimal rates for the regularized least-squares algorithm.” 2007.</li>
<li><strong><a href="https://arxiv.org/abs/1010.0072" target="_blank">AC10</a>.</strong> Audibert and Catoni. “Linear regression through PAC-Bayesian truncation.” 2010.</li>
<li><strong><a href="https://projecteuclid.org/journals/annals-of-statistics/volume-45/issue-3/Asymptotics-of-empirical-eigenstructure-for-high-dimensional-spiked-covariance/10.1214/16-AOS1487.full" target="_blank">WF17</a>.</strong> Wang and Fan. “Asymptotics of empirical eigenstructure for high dimensional spiked covariance.” 2017.</li>
<li><strong><a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a>. <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>.</strong> Belkin, Hsu, and Xu. “Two models of double descent for weak features.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>. <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>.</strong> Bartlett, Long, Lugosi, and Tsigler. “Benign overfitting in linear regression.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a>. <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>.</strong> Hastie, Montanari, Rosset, and Tibshirani. “Surprises in high-dimensional ridgeless least squares interpolation.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/1906.03667" target="_blank">Mit19</a>.</strong> Mitra. “Understanding overfitting peaks in generalization error: Analytical risk curves for \(\ell_2\) and \(\ell_1\) penalized interpolation.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/1912.13421" target="_blank">MN19</a>.</strong> Mahdaviyeh and Naulet. “Risk of the least squares minimum norm estimator under the spike covariance model.” 2019.</li>
<li><strong><a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a>. <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>.</strong> Muthukumar, Vodrahalli, Subramanian, and Sahai. “Harmless interpolation of noisy data in regression.” 2019.</li>
<li><strong><a href="https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf" target="_blank">XH19</a>. <a href="/2021/09/11/xh19.html" target="_blank">[OPML#6]</a>.</strong> Xu and Hsu. “On the number of variables to use in principal component regression.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/2010.08479" target="_blank">BL20</a>. <a href="/2021/07/30/bl20.html" target="_blank">[OPML#5]</a></strong> Bartlett and Long. “Failures of model-dependent generalization bounds for least-norm interpolation.” 2020.</li>
<li><strong><a href="https://arxiv.org/abs/2011.11477" target="_blank">HHV20</a>.</strong> Huang, Hogg, and Villar. “Dimensionality reduction, regularization, and generalization in overparameterized regressions.” 2020.</li>
</ul>
</li>
<li><em>Ridge regression:</em>
<ul>
<li><strong><a href="https://arxiv.org/abs/1507.03003" target="_blank">DW15</a>.</strong> Dobriban and Wagner. “High-dimensional asymptotics of prediction: ridge regression and classification.” 2015.</li>
<li><strong><a href="https://arxiv.org/abs/2009.14286" target="_blank">TB20</a>.</strong>Tsigler and Bartlett. “Benign overfitting in ridge regression.” 2020.</li>
</ul>
</li>
<li><em>Kernel regression:</em>
<ul>
<li><strong><a href="https://direct.mit.edu/neco/article/17/9/2077/7007/Learning-Bounds-for-Kernel-Regression-Using" target="_blank">Zha05</a>.</strong> Zhang. “Learning bounds for kernel regression using effective data dimensionality.” 2005.</li>
<li><strong><a href="http://proceedings.mlr.press/v99/rakhlin19a.html" target="_blank">RZ19</a>.</strong> Rakhlin and Zhai. “Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon.” 2019.</li>
<li><strong><a href="http://proceedings.mlr.press/v125/liang20a.html" target="_blank">LRZ20</a>.</strong> Liang, Rakhlin, and Zhai. “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels.” 2020.</li>
</ul>
</li>
<li><em>Support Vector Machines:</em>
<ul>
<li><strong><a href="https://arxiv.org/abs/2005.11818" target="_blank">BHMZ20</a>.</strong> Bousquet, Hanneke, Moran, and Zhivotovskiy. “Proper learning, Helly number, and an optimal SVM bound.” 2020.</li>
<li><strong><a href="https://arxiv.org/abs/2005.08054" target="_blank">MNSBHS20</a>.</strong> Muthukumar, Narang, Subramanian, Belkin, Hsu, and Sahai. “Classification vs regression in overparameterized regimes: Does the loss function matter?” 2020.
<!-- * **[WT20](https://arxiv.org/abs/2011.09148){:target="_blank"}.** Wang and Thrampoulidis. "Binary classification of Gaussian mixtures: abundance of support vectors, benign overfitting and regularization." 2020. -->
<!-- * **[CGB21](https://arxiv.org/abs/2104.13628){:target="_blank"}.** Cao, Gu, and Belkin. "Risk bounds for over-parameterized maximum margin classification on sub-Gaussian mixtures." 2021. --></li>
<li><strong><a href="https://arxiv.org/abs/2004.12019" target="_blank">CL20</a>.</strong> Chatterji, Long. “Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime.” 2020.</li>
</ul>
</li>
<li><em>Random feaures models:</em>
<ul>
<li><strong><a href="https://arxiv.org/abs/1908.05355" target="_blank">MM19</a>.</strong> Mei and Montanari. “The generalization error of random features regression: Precise asymptotics and double descent curve.” 2019.</li>
</ul>
</li>
</ul>
</li>
<li><strong>Training beyond zero training error in boosting:</strong>
While not exactly an over-parameterized model, one well-known example of when training a model past the point of perfectly fitting the data can produce better population errors is with the AdaBoost algorithm.
I’ll discuss the original AdaBoost paper and how arguments about the margins of the resulting classifiers suggest that there’s more to training ML models than finding the “sweet spot” discussed above and avoiding overfitting.
<ul>
<li><strong><a href="https://www.sciencedirect.com/science/article/pii/S002200009791504X" target="_blank">FS97</a>.</strong> Freund and Schapire. “A decision-theoretic generalization of online learning and an application to boosting.” 1997.</li>
<li><strong><a href="https://projecteuclid.org/journals/annals-of-statistics/volume-26/issue-5/Boosting-the-margin--a-new-explanation-for-the-effectiveness/10.1214/aos/1024691352.full" target="_blank">BFLS98</a>.</strong> Bartlett, Freund, Lee, and Schapire. “Boosting the margin: a new explanation for the effectiveness of voting methods.” 1998.</li>
</ul>
</li>
<li><strong>Interpolation of arbitary data in neural networks and kernel machines:</strong>
These papers show that both neural networks and kernel machines can interpret data with arbitrary labels and still generalize, even when some fraction of data are noisy.
The former challenges the narrative that classical learning theory that’s oriented around avoiding overfitting can explain deep learning generalization.
The latter suggests that the phenomena are not unique to deep neural networks and that simpler linear models are ripe for study as well.
<ul>
<li><strong><a href="https://arxiv.org/abs/1611.03530" target="_blank">ZBHRV17</a>.</strong> Zhang, Bengio, Hardt, Recht, and Vinyals. “Understanding deep learning requires rethinking generalization.” 2017.</li>
<li><strong><a href="https://arxiv.org/abs/1802.01396" target="_blank">BMM18</a>.</strong> Belkin, Ma, and Mandal. “To understand deep learning we need to understand kernel learning.” 2018.</li>
<li><strong><a href="https://arxiv.org/abs/2002.01523" target="_blank">AAK20</a>.</strong> Agarwal, Awasthi, Kale. “A Deep Conditioning Treatment of Neural Networks.” 2020.</li>
</ul>
</li>
<li><strong>Empirical evidence of double descent in deep neural networks:</strong>
I’ll survey a variety of papers that relate the above theoretical ideas to experimental results in deep learning.
<ul>
<li><strong><a href="https://www.pnas.org/content/116/32/15849" target="_blank">BHMM19</a>.</strong> Belkin, Hsu, Ma, and Mandal. “Reconciling modern machine-learning practice and the classical bias–variance trade-off.” 2019.</li>
<li><strong><a href="https://arxiv.org/abs/1912.02292" target="_blank">NKBYBS19</a>.</strong> Nakkiran, Kaplun, Bansal, Yang, Barak, and Sutskever. “Deep double descent: Where bigger models and more data hurt.” 2019.</li>
<li><strong><a href="https://iopscience.iop.org/article/10.1088/1751-8121/ab4c8b" target="_blank">SGDSBW19</a>.</strong> Spigler, Geiger, d’Ascoli, Sagun, Biroli, and Wyart. “A jamming transition from under- to over-parametrization affects generalization in deep learning.” 2019.
<!-- * **[NVKM20](https://arxiv.org/abs/2003.01897){:target="_blank"}.** Nakkiran, Venkat, Kakade, and Ma. "Optimal regularization can mitigate double descent." 2020. --></li>
</ul>
</li>
<li><strong>Smoothness of interpolating neural networks as a function of width:</strong>
Since one of the core benefits of over-parameterized interpolation models is obtaining very smooth functions \(f_\theta\), there’s an interest in understanding how the number of parameters of a neural network can be translated to the smoothnuess of \(f_\theta\).
These papers attempt to establish that relationship.
<ul>
<li><strong><a href="https://arxiv.org/abs/2009.14444" target="_blank">BLN20</a>. <a href="/2021/09/22/bubeck.html" target="_blank">[OPML#7]</a>.</strong> Bubeck, Li, and Nagaraj. “A law of robustness for two-layers neural networks.” 2020.</li>
<li><strong><a href="https://arxiv.org/abs/2105.12806" target="_blank">BS21</a>. <a href="/2021/09/22/bubeck.html" target="_blank">[OPML#7]</a>.</strong> Bubeck and Sellke. “A universal law of robustness via isoperimetry.” 2021.</li>
</ul>
</li>
</ul>
<h3 id="posts-so-far">Posts so far</h3>
<p>You can also track these under the <a href="/tag/candidacy"><code class="highligher-rouge"><nobr>candidacy</nobr></code></a> tag page.</p>
<ul>
<li><a href="/2021/07/04/candidacy-overview.html" target="_blank">[OPML#0] A series of posts on over-parameterized machine learning models</a> (04 Jul 2021)<br />
</li>
<li><a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1] BHX19: Two models of double descent for weak features</a> (05 Jul 2021)<br />
</li>
<li><a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2] BLLT19: Benign overfitting in linear regression</a> (11 Jul 2021)<br />
</li>
<li><a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3] MVSS19: Harmless interpolation of noisy data in regression</a> (16 Jul 2021)<br />
</li>
<li><a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4] HMRT19: Surprises in high-dimensional ridgeless least squares interpolation</a> (23 Jul 2021)<br />
</li>
<li><a href="/2021/07/30/bl20.html" target="_blank">[OPML#5] BL20: Failures of model-dependent generalization bounds for least-norm interpolation</a> (30 Jul 2021)<br />
</li>
<li><a href="/2021/09/11/xh19.html" target="_blank">[OPML#6] XH19: On the number of variables to use in principal component regression</a> (11 Sep 2021)<br />
</li>
<li><a href="/2021/09/22/bubeck.html" target="_blank">[OPML#7] BLN20 & BS21: Smoothness and robustness of neural net interpolators</a> (22 Sep 2021)<br />
</li>
</ul>
<h2 id="whats-next">What’s next?</h2>
<p>This overview post is published alongside the first paper summary about <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a>, which is <a href="/2021/07/05/bhx19.html" target="_blank">here</a>.
This paper nicely explains how linear regression can perform well in an over-parameterized regime by interpolating the data.
It relies on a fairly straight-forward mathematical argument that focuses on a toy model with samples drawn from a nice data distribution, and it’s helpful for seeing the kinds of results that we can expect to prove about models with more parameters than samples.
<!-- This paper provides a detailed understanding of the performance of linear regression models under a broad set of distributional assumptions.
Notably, this paper does not actually handle the over-paramaterized regime; they have a fixed number of parameters $$p$$ and consider the limiting case where $$n \to \infty$$.
The purpose of starting with this one is to understand how researchers typically thought about why linear models worked before the last few years.
I'll shift to focusing on papers about over-parameterization in the weeks to come.
-->I’ll aim to write a new blog post about a different paper each week until I’m done.</p>
<p>Because research papers are by nature highly technical and because I’m trying to understand them in their full depth, most of these posts will only be accessible to readers with some background in my field.
However, I also don’t want to write posts that’ll be useless to anyone who isn’t pursuing a PhD in machine learning theory.
My intention is that readers with some amount of background in ML will be able to read them to understand what kinds of questions learning theorists ask; if someone who’s taken an undergraduate-level ML and algorithms course can’t udnerstand what I’m writing, then that’s on me.
I’ll periodically give more technical asides into proof techniques that’ll only make sense to people who work directly on research in this area, but I’ll flag them so they’ll be easily skippable.</p>
<p>Maybe this blog will become something more with time… I’m trying to get my feet wet by talking mostly about technical topics that will be primarily of interest to people in my research field, but I may end up branching out to broader subjects that will interest people who aren’t theoretical CS weirdos.</p>
<p>Thanks for making it this far! Hope to see you next time. If I ever get the comments section working, let me know what you think!</p>Clayton SanfordHello, and welcome to the blog! I’ve been wanting to start this for awhile, and I’ve finally jumped in. This is the introduction to a series of blog posts I’ll be writing over the course of the summer and early in the fall. I hope these posts are informative, and I welcome any feedback on their technical content and writing quality.