Jekyll2021-11-17T22:49:31+00:00http://blog.claytonsanford.com/feed.xmlClayton’s BlogRandom thoughts about machine learning, algorithms, math, running, living in New York, and more.Clayton SanfordMy candidacy exam is done!2021-11-17T00:00:00+00:002021-11-17T00:00:00+00:00http://blog.claytonsanford.com/2021/11/17/candidacy<p>After working my way through all thirty papers from my <a href="/2021/07/04/candidacy-overview.html" target="_blank">list</a>, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.</p>
<p>Anyways, <a href="/assets/files/candidacy-slides.pdf" target="_blank">here</a> are the slides I made for the presentation. At the end, there are thirty appendix slides, each of which gives a one-slide summary of a paper from the list.</p>
<p>I’m planning on continuing to blog from here, but not in such a structured fashion as I did with my OPML series.
I’m considering having some kind of weekly newsletter that gives quick recaps of papers I’ve read and things I find interesting, along with periodic longer posts about particularly neat papers, my own work, or personal stuff.
Thanks for reading, and stay tuned!</p>Clayton SanfordAfter working my way through all thirty papers from my list, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.[OPML#10] MNSBHS20: Classification vs regression in overparameterized regimes: Does the loss function matter?2021-11-04T00:00:00+00:002021-11-04T00:00:00+00:00http://blog.claytonsanford.com/2021/11/04/mnsbhs20<!-- [[OPML#10]](/2021/11/04/mnsbhs20.html){:target="_blank"} -->
<p><em>This is the tenth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Once again, we discuss a paper that shows how hard-margin support vector machines (SVMs) (or maximum-margin linear classifiers) can experience benign overfitting when the learning problem is over-parameterized.
The paper, <a href="https://arxiv.org/abs/2005.08054" target="_blank">“Classification vs regression in overparameterized regimes: Does the loss function matter?”</a>, was written by a Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu (my advisor!), and Anant Sahai.</p>
<p>While the kinds of results are similar to the ones discussed in <a href="/2021/10/20/boosting.html" target="_blank">last week’s post</a>, the methodology is quite different. Rather than studying the properties of the iterates of gradient descent, this paper shows that minimum-norm linear regression and SVMs coincide in the over-parameterized regime and shows that the models behave similarly in those cases; this phenomenon is known as <em>support vector proliferation</em> and discussed in depth by <a href="https://arxiv.org/abs/2009.10670" target="_blank">a follow-up paper by Daniel, Vidya, and Ji (Mark) Xu</a> and by <a href="https://arxiv.org/abs/2105.14084" target="_blank">my NeurIPS paper with Navid Ardeshir and Daniel</a>.</p>
<p>To make the point, the paper considers a narrow regime of data distributions and categorizes those distributions to determine (1) when the outputs of OLS regression and SVM classification coincide and (2) when each of those have favorable generalization error as the number of samples \(n\) and dimension \(d\) trend towards infinity.
We introduce their <em>bilevel ensemble</em> input distribution and their <em>1-sparse linear model</em> for determining labels.
Their results show that under similar conditions to those explored in BLLT19, benign overfitting is possible for classification algorithms like SVMs.
Indeed, for their distributional assumptions, benign overfitting is more common for classification than regression.</p>
<h2 id="ols-and-svm">OLS and SVM</h2>
<p>A key part of this paper’s story relies on the coincidence of support vector machines for classification and ordinary least squares for regression.
We introduce the two models and clarify why one might expect them to have similar solutions for the high-dimensional setting.</p>
<p>From last week, we define the hard-margin SVM classifier to be \(x \mapsto \text{sign}(\langle w_{SVM}, x\rangle)\) where</p>
\[w_{SVM} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \{-1, 1\}\).
This classifier maximizes the margins of linearly separable training data.
Notably, a training sample \((x_i, y_i)\) is a <em>support vector</em> if \(\langle w_{SVM}, x_i\rangle = y_i\), which means that \(x_i\) lies exactly on the margin and is as close as possible to the linear separator.
The hypothesis \(w_{SVM}\) can be alternatively represented as a linear combination of support vectors, which means that all samples not on the margin are irrelevant to the SVM classifier vector.
Traditionally, favorable generalization properties for SVMs are shown for the cases where the number of support vectors is small, which implies some degree of “simplicity” in the model.</p>
<p>If the model is over-parameterized (i.e. \(d > n\)), we define the <em>minimum-norm ordinary least squares (OLS) regression</em> predictor to be \(x \mapsto \langle w_{OLS}, x\rangle\) where</p>
\[w_{OLS} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle = y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}\).
The two are the same, except that the labels are \(\{-1, 1\}\) for SVM and \(\mathbb{R}\) for OLS and that the inequality constraints of the former are replaced by equalities in the latter.</p>
<p>Sufficient conditions for benign overfitting for OLS has been explored in past blog posts, like the ones on <a href="/2021/07/05/bhx19.html" target="_blank">BHX19</a>, <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, and <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a>.
Conditions for SVMs were explored in <a href="/2021/10/28/cl20.html" target="_blank">CL20</a>.
This paper unifies the two by showing cases where \(w_{OLS} = w_{SVM}\) and transfers the benign overfitting results from OLS to SVMs.</p>
<p>If we assume that both problems (regression and classification) have \(\{-1, 1\}\) labels, then \(w_{OLS} = w_{SVM}\) is implied by having \(\langle w_{SVM}, x_i\rangle = y_i\) for all \(i\), which means that every sample is a support vector.
This is the support vector proliferation phenomenon briefly discussed before.</p>
<h2 id="data-model">Data model</h2>
<p>They prove their results over a simple data distribution, which is a special case of the distributions explored by BLLT19.
Specifically, they consider <em>bilevel Gaussian ensembles</em>, features \(x_i\) are drawn independently from a Gaussian distribution with diagonal covariance matrix \(\Sigma\) with diagonals \(\lambda_1, \dots, \lambda_d\) for \(d = n^p\) satisfying</p>
\[\lambda_j = \begin{cases}
n^{p - r - q} & j \leq n^r \\
\frac{1 - n^{-q}}{1 - n^{r - p}} & j \geq n^r
\end{cases}\]
<p>for \(p > 1\), \(r \in (0, 1)\), and \(q \in (0, p-r)\).
It’s called a bilevel ensemble because the first \(n^r\) coordinates are drawn from higher variance normal distributions than the remaining \(n^p - n^r\) coordinates. A few notes on this model:</p>
<ul>
<li>Because \(p > 1\), \(d = \omega(n)\) and the model is always over-parameterized.</li>
<li>\(r\) governs the number of high-importance features. Because \(r < 1\), there must always be a sublinear number of high-importance features.</li>
<li>If \(q\) were permitted to be \(p - r\), then the model would be spherical or isotropic and have \(\lambda_j = 1\) for all \(j\). On the other hand, if \(q = 0\), \(\lambda_j = 0\) for \(j \geq n^r\) and all of the variance would be on the first \(n^r\) features. Thus, \(q\) modulates how much more variance the high-importance features have than the low-importance features.</li>
<li>
<p>The variances are normalized to have their \(L_1\) norms always be \(d = n^p\):</p>
\[\|\lambda\|_1 = \sum_{j=1}^{n^d} \lambda_j = n^{p} \cdot n^{-q} + \frac{(n^p - n^r)(1 - n^{-q})}{1 - n^{r-p}} = n^{p} \cdot n^{-q} + n^p (1 - n^{-q}) = n^p.\]
</li>
<li>
<p>We can compute the effective dimension terms used in the BLLT19 paper:</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} = \begin{cases}
\frac{(n^r - k)n^{p-q} + n^p (1 - n^{-q})}{n^{p-r-q}} &= \Theta(n^{r + q}) & k < n^r \\
n^p - k & & k \geq n^r.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2}= \begin{cases}
\Theta(\min(n^p, n^{r + 2q})) & k < n^r \\
n^p - k & k \geq n^r.
\end{cases}\]
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/lambdas.jpeg" alt="" /></p>
<p>The labels \(y\) are chosen with the <em>1-sparse linear model</em>, which only considers one of the coordinates. That is, for some \(t \leq n^r\), we let \(w^* = \lambda_t^{-1/2} e_t\), where \(e_t \in \mathbb{R}^d\) is the vector that is all zeroes, except for a one at index \(t\).
Note that \(\|w^*\|^2 = \lambda_t^{-1} = n^{r+q - p}\).
That is, the labels are \(\text{sign}(\langle w^*, x\rangle) = \text{sign}(x_t)\).
<!-- We add noise by flipping the label with probability $$\sigma$$. -->
(For regression, we instead think of the labels as \(\langle w^*, x\rangle = x_t\).)</p>
<p>Their theoretical results rely on having no label noise.</p>
<p>From this data model alone, we can plug in the bounds of BLLT19 to see what they tell us. <em>Note: There actually isn’t a perfect analogue here, because BLLT includes additive label noise with variance \(\sigma^2\). The purpose of these bounds is to illustrate what is known about a similar model.</em></p>
<ul>
<li>If \(r + q > 1\), then \(k^* = \min\{k \geq 0: r_k(\Sigma) = \Omega(n)\}\) is roughly \(n^p - O(n)\), which means that the \(\frac{k^*}{n}\) term of the bound makes the bound vacuous.</li>
<li>
<p>If \(r + q < 1\), then \(k^* = 0\). Then, the BLLT19 bounds yield an excess risk of at most</p>
\[O\left( \|w^*\|^2 \lambda_1 \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{\sigma^2 n}{R_{0}(\Sigma)} \right) = O\left( \sqrt{n^{r + q - 1}}+ \sigma^2 \max(n^{1-p}, n^{1-r-2q}) \right).\]
<p>For this bound to trend towards zero, it must be true that \(r + 2q > 1\) and that \(r+ q < 1\), which is already guaranteed.</p>
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/bllt.jpeg" alt="" /></p>
<p>The bound given in the paper at hand will look slightly different. (e.g. it won’t have the first requirement because noise is done differently.)
In addition, it will distinguish between benign overfitting in the classification and regression regimes and show that it’s easier to obtain favorable generalization error bounds for regression.</p>
<h2 id="main-results">Main results</h2>
<p>They have two types of main results: Theorem 1 shows the sufficient conditions for the coincidence the SVM and OLS weights \(w_{SVM}\) and \(w_{OLS}\), and Theorem 2 analyzes the generalization of the excess errors of both classification and regression.</p>
<h3 id="when-does-svm--ols">When does SVM = OLS?</h3>
<p><em><strong>Theorem 1:</strong> For sufficiently large \(n\), \(w_{SVM} = w_{OLS}\) with high probability if</em></p>
<p>\(\|\lambda\|_1 = \Omega(\|\lambda\|_2 n \sqrt{\log n} + \|\lambda\|_\infty n^{3/2} \log n)\).</p>
<p>Equivalently, it must hold that \(R_0(\Sigma) = \Omega(\sqrt{n}(\log n)^{1/4})\) and \(r_0(\Sigma) = \Omega(n^{3/2} \log n)\).
The holds for the bilevel model when \(r + q > \frac{3}{2}\).</p>
<p>In the <a href="https://arxiv.org/abs/2009.10670" target="_blank">two</a> <a href="https://arxiv.org/abs/2105.14084" target="_blank">follow-ups</a>, this bound is changed to \(r_0(\Sigma) = \Omega(n \log n)\) and the phenomenon is shown to NOT occur when \(R_0(\Sigma) = O(n \log n)\).
Thus, this can actually be shown to occur for the bilevel model when \(r + q > 1\).</p>
<p>The proof of the theorem in this paper relies on applying bounds on Gaussian concentration and properties of the <a href="https://en.wikipedia.org/wiki/Inverse-Wishart_distribution" target="_blank">inverse-Wishart distribution</a>.
The future results rely on tighter concentration bounds, a leave-one-out equivalence that is true when a sample is a support vector, and a trick that relates the relevant quantities to a collection of independent random variables.</p>
<h3 id="generalization-bounds">Generalization bounds</h3>
<p>Their generalization bounds apply to the OLS solutions for two cases, (1) where the labels are real-valued and (2) where the labels are Boolean \(\{-1,1\}\).
We call the minimum norm solutions of these \(w_{OLS, real}\) and \(w_{OLS, bool}\).
Thus, when \(r\) and \(q\) are large enough for Theorem 1 to guarantee that OLS = SVM, then the bounds for Boolean labels apply.</p>
<p><em><strong>Theorem 2:</strong> For a bilevel data model that is 1-sparse without label noise, the classification error \(\lim_{n \to \infty} \mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0]\) and regression excess MSE error \(\lim_{n \to \infty} \mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2]\) satisfy the following for the given settings of \(p\), \(q\), and \(r\):</em></p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{OLS, bool}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (0, 1)\)</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p>This table tells us several things about the differences in generalization between classification and regression.</p>
<ul>
<li>When \(\Sigma\) has a relatively even distribution of variance between the high-importance and low-importance coordinates and when there are relatively few coordinates, there tends to be favorable generalization for both classification and regression.
The reverse is true when there is a sharp cut-off between variances and when there are many high-importance features.
This fits a similar intuition to BLLT19, which forbids too sharp a decay of variances.</li>
<li>One might observe that this doesn’t have the other requirement from BLLT: that the variances do not decay too gradually, which is enforced by \(r + 2q > 1\). This is absent here because this paper does not consider label noise, so the risk of a model being corrupted by overfitting noisy labels is minimized.</li>
<li>There is also a regime in between where classification generalizes well, but regression does not.</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/ols.jpeg" alt="" /></p>
<p>By combining the improved results on support vector proliferation with Theorem 2, we can obtain the following table of results for SVM vs OLS.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{SVM}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p><img src="/assets/images/2021-11-04-mnsbhs20/svm.jpeg" alt="" /></p>
<p>How do these generalization bounds work? They’re similar to the flavor of argument given in <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, which considers signal bleed and signal contamination.
Put roughly, an interpolating model can perform poorly if either the true signal gets split up among a bunch of orthogonal aliases that each interpolate the training data (signal bleed), or too many spurious correlations are incorporated into the chosen alias (signal contamination).
They assess and bound these notions by introducing the <em>survival</em> and <em>contamination</em> terms as</p>
\[\mathsf{SU}(w, t) = \frac{w_t}{w^*_t} = \sqrt{\lambda_t} w_t \ \text{and} \ \mathsf{CN}(w, t) = \sqrt{\mathbb{E}[(\sum_{j\neq t} w_j x_j)^2]} = \sqrt{\sum_{j \neq t} \lambda_j w_j^2}\]
<p>This formulation is easy due to the 1-stable assumption of the labels.
It seems like it may be possible to write something similar without this data model, but it would probably require much uglier expressions and more complex distributional assumptions to make the proof work.</p>
<p>The proof then uses Proposition 1 to relate the classification and regression errors to the survival and contamination terms and concludes by using Lemmas 11, 12, 13, 14, and 15 to place upper and lower bounds on those terms. Prop 1 shows the following relationships:</p>
\[\mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0] = \frac12 - \frac1{\pi} \tan^{-1} \left(\frac{\mathsf{SU}(w, t)}{\mathsf{CN}(w, t)} \right)\]
\[\mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2] = (1 - \mathsf{SU}(w, t))^2 + \mathsf{CN}(w, t)^2\]
<p>From looking at these terms, it should be intuitive why classification error is more likely to go to zero than regression error: It is sufficient for \(\mathsf{CN}(w, t)\) to become arbitrarily small for the the classification error to approach zero, even if \(\mathsf{SU}(w, t)\) is a constant smaller than 1. On the other hand, it must be the case that \(\mathsf{CN}(w, t)\to 0\) <em>and</em> \(\mathsf{SU}(w, t)\to 1\) for the regression error to go to zero.</p>
<p>The concentration bounds in the lemmas are gory and I don’t plan to go into them here. They rely on a slew of concentration bounds that are made possible by the Gaussianity of the inputs and the tight control of their variances.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>This was another really interesting paper for me, although I wasn’t quite brave enough to venture through all of the proofs of this one.
It’s primarily interesting as a proof of concept; the assumptions are prohibitively restrictive (only one relevant coordinate, Gaussian inputs, no label noise), but the proofs would have been sickening to the point of being unreadable if many of these assumptions were dropped. This paper was an inspiration for my collaborators and me to investigate support vector proliferation in more depth, and these are a nice complement to CL20, which proves bounds for more restricted values of \(d\) in the presence of label noise and without relying on limits.</p>
<p>Thanks for joining me once again! The next entry–and possibly the last entry of this series–will be posted next week. When the actual exam occurs in two weeks, I might have one last recap post of what’s been discussed so far.</p>Clayton Sanford[OPML#9] CL20: Finite-sample analysis of interpolating linear classifiers in the overparameterized regime2021-10-28T00:00:00+00:002021-10-28T00:00:00+00:00http://blog.claytonsanford.com/2021/10/28/cl20<!-- [[OPML#9]](/2021/10/28/cl20.html){:target="_blank"} -->
<p><em>This is the ninth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Like <a href="/2021/10/20/boosting.html" target="_blank">last week’s post</a>, we’ll step away from linear regression and discuss how over-parameterized <em>classification</em> models can achieve good generalization performance.
Unlike last week’s post, we focus on <em>maximum-margin classifiers</em> (or <em>support vector machines</em>) that interpolate the data in high-dimensional settings.
The paper is called <a href="https://arxiv.org/abs/2004.12019" target="_blank">“Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime”</a> and was written by Nilidri Chatterji and Philip Long.</p>
<h2 id="maximum-margin-classifier">Maximum-margin classifier</h2>
<p>Suppose we have some linearly separable training data.
There are many different strategies of choosing a linear separator for those data, and it’s unclear off the bat which ones will generalize best to novel samples.
To sketch the issue, the below visualization shows how two linearly separable classes have many valid hypotheses that interpolate the training data and have zero training error.</p>
<p><img src="/assets/images/2021-10-28-cl20/separators.jpeg" alt="" /></p>
<p>The <em>maximum-margin classifier</em> chooses the separating hyperplane that, well, maximizes the margins between the separator and the two classes.
In the below visualization, the yellow separator is the hyperplane orthogonal to the vector \(w\) that most decisively classifies every positive and negative sample correctly.
That is, none of the sample are close to the separator, and \(w\) is chosen to have the largest <em>margin</em>, or gap between the data and the separator.
The space between the solid separator and the two dashed lines is the margin, a sort of demilitarized zone between the two classes of samples.</p>
<p><img src="/assets/images/2021-10-28-cl20/margin.jpeg" alt="" /></p>
<p>In order to quantify the margin, we require that \(w\) is chosen to ensure that \(\langle w, x_i\rangle \geq y_i\) for \(y_i \in \{-1,1\}\).
The width of the margin can be computed to be at least \(\frac1{\|w\|}\) if we use enforce this requirement.
Therefore, maximum-margin classifier is</p>
\[\mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^p} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>where \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^p \times \{-1, 1\}\) are the training samples.</p>
<p>Last week’s blog post discussed in detail why maximum-margin classifiers can lead to good generalization. Primarily, having large margins means that the classifier is robust and will categorize correctly samples that are drawn near any of the training samples.
This is great, and provided a ton of insight into why overfitting is not always a bad thing. However, those results were limited in their applicability:</p>
<ul>
<li>They only apply to voting-based classifiers with margins, and this maximum-margin classifier does not aggregate together multiple weak classifiers.</li>
<li>Their bounds only apply to perfectly clean training data; if an $\eta$-fraction of the samples have incorrect labels, then their bounds fall apart.</li>
</ul>
<p>This paper suggests that these kinds of bounds are possible for the maximum margin classifier when the dimension is much larger than the number of samples.</p>
<p><em>Aside: Their formulation of the maximum-margin classifier is identical to that of the</em> support vector machine (SVM)<em>. The samples that lie on the margin (in our case, two red samples and two blue samples on the dotted lines) are</em> support vectors, <em>which the separator can be written in terms of. Classifical capacity-based generalization approaches for SVMs relies on having few support vectors, but <a href="https://arxiv.org/abs/2005.08054" target="_blank">some</a> <a href="https://arxiv.org/abs/2011.09148" target="_blank">recent</a> <a href="https://arxiv.org/abs/2104.13628" target="_blank">works</a> have shown that generalization bounds can be proved in a setting with many support vectors. <a href="https://arxiv.org/abs/2105.14084" target="_blank">One of my papers</a>, which will appear at NeurIPS 2021 (and which I’ll discuss in a forthcoming blog post) proves when</em> support vector proliferation <em>, a phenomena where every samples is a support vector, occurs.</em></p>
<h2 id="data-model">Data model</h2>
<p>Like the linear regression papers we’ve discussed, this paper exhibits the phenomenon of benign overfitting under strict distributional assumptions. We present a simplified version of their data model below.</p>
<ul>
<li>A label \(\tilde{y} \in \{-1,1\}\) is chosen by a coin flip. With probability \(\eta\) (which can be no larger than some constant less than 1), the label is <em>corrupted</em> and \(y = - \tilde{y}\). Otherwise, \(y = \tilde{y}\).</li>
<li>For some <em>mean vector</em> \(\mu \in \mathbb{R}^p\) and some \(q\) drawn from a \(p\)-dimensional subgaussian distribution with a lower-bound on expected norm, the input \(x\) is chosen to be \(q + \tilde{y} \mu\).</li>
</ul>
<p>That is, the inputs belong to one of two regions: either clustered around \(\mu\) if \(\tilde{y} = 1\) and \(-\mu\) if \(\tilde{y} = -1\).
Intuitively, this means the learning problem is much easier to if \(\mu\) is large, because the clusters will be more sharply separated.</p>
<p><img src="/assets/images/2021-10-28-cl20/data.jpeg" alt="" /></p>
<p>The data model is limited by the fact that they assume this kind of two-cluster structure. However, it’s intended as a proof of concept of sorts, and the setup allows one to explore how changing the number of samples \(n\), the dimension \(d\), and the distinctiveness of classes \(\|\mu\|^2\) shapes which bounds are possible.</p>
<p>They give several examples of this data model, and I’ll recount their Example 3, which they call the <em>Boolean noisy rare-weak model.</em>
They sample \(y\) and \(\tilde{y}\) as above.
\(x\) is drawn from a distribution over \(\{-1,1\}^p\), where \(x_1, \dots, x_s\) independently equal \(\tilde{y}\) with probabililty \(\frac12 + \gamma\) and \(-\tilde{y}\) otherwise, for some \(s \leq p\) and \(\gamma \in (0, \frac12)\). \(x_{s+1}, \dots, x_p\) are the results of independent fair coin tosses.</p>
<h2 id="main-result">Main result</h2>
<p>Their main result is a generalization bound for this two-cluster data model.
The result relies on several assumptions about \(n\), \(d\), and \(\mu\).</p>
<p><em><strong>Theorem 4:</strong> Suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\|\mu\|^2n, n^2 \log n))\), (3) \(\|\mu\|^2 = \Omega(\log n)\), and (4) \(p = O(\|\mu\|^4 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>The main inequality is a bound on the generalization error of the classifier \(w\) because it deals with new samples, rather than the ones used to train the classifier.
The \(\eta\) term in the error is unavoidable, because any sample will be corrputed with probability \(\eta\).
The \(\epsilon\) term is the more interesting one, which governs the excess error.</p>
<p>The requirement that \(p = \Omega(n^2 \log n)\) means the model must be in a <em>very</em> high-dimensional regime. Recall that papers like <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a> consider a regime where \(p = \Theta(n)\); here, this paper only says anything about generalization when \(p\) is much larger than \(n\). We also require pretty specific conditions about \(\mu\).</p>
<p>To make life easier, let \(\mu = (q, 0, \dots, 0) \in \mathbb{R}^p\). The excess error can only be small then if \(q \gg p^{1/4}\). Since it must also be the case that \(q \ll \sqrt{p/ n}\), this gives a relatively narrow interval that \(p\) can belong to.</p>
<p>They formulate the theorem specifically for the example we consider as well.</p>
<p><em><strong>Corollary 6:</strong> For the Boolean noisy rare-weak model, suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\gamma^2 s n, n^2 \log n))\), (3) \(\gamma^2 s = \Omega(\log n)\), and (4) \(p = O(\gamma^4 s^2 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>This means that if \(\gamma\) is some constant like \(0.25\), it must be true that \(s \gg \sqrt{p}\) and \(s \ll p/n\).
Therefore, only a small fraction of the dimensions of \(x\) can be indicative of the label \(y\), and most of the input is just noise.
Or, if \(s = p\) and every feature is significant, then \(\gamma\) must satisfy \(\gamma \ll 1/\sqrt{n}\) and \(\gamma \gg 1 / p^{1/4}\), which means that each feature will only have a minute amount of signal.
This closely resembles the kinds of settings that we showed have good generalization for linear regression in <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a> long ago.</p>
<h2 id="proof-overview">Proof overview</h2>
<p>The proof relies on a proof by <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a> that using gradient descent to optimize logistic regression for separable data gives a separating hyperplane that maximizes margins.
That is, gradient descent with a logistic loss function has an implicit bias that leads to the same solution as to that of an SVM.</p>
<p>In Lemma 9, they use a simple concentration bound to show that the generalization error is small if \(\langle w, \mu\rangle\) is small, where \(\mu\) is the mean vector and \(w\) is the learned classifier.
They relate this to the classifiers obtained in each step of gradient descent \(v^{(t)}\) and bound \(\langle v^{(t)}, \mu\rangle\) by expanding the gradient step to write \(v^{(t)}\) in terms of all previous risks.
Taking a limit of \(t \to \infty\) relates this to the maximum-margin classifier.</p>
<p>Lemma 10 lower-bounds the target inner product. A key component of the proof of that is Lemma 14, which shows that the loss caused by any one sample cannot be much more than that of any other sample with high probability.
This is important because it means that the noisy samples (with flipped \(\mu\)) cannot have outsize impact on the result, and that the analysis is robust to those errors.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>This paper was neat, since it showed something similar to what was uncovered about minimum-norm linear regression by a variety of papers previously surveyed.
It’s neat to also see this as a strengthening of the margin work discussed last week under boosting, since these results work for samples with noisy labels and for non-voting margin-based classifiers.</p>
<p>However, they’re limited by degree of over-parameterization/the size of the dimension needed; \(p = \Omega(n^2 \log n)\) is a pretty steep requirement, especially since results like my <a href="https://arxiv.org/abs/2105.14084" target="_blank">OLS=SVM paper</a> suggest that minimum-norm regression (with samples drawn with labels in \(\{-1,1\}\)) and maximum-margin classifiers coincide when \(p = \Omega(n \log n)\).
They specifically identify the improvement on the dependence of \(p\) as motivation for future work, and I hope to see that tackled at some point.</p>
<p><em>Thanks for reading this week’s entry! The actual exam is coming up on November 16th, and you should expect at least two more posts about papers before then!</em></p>Clayton Sanford[OPML#8] FS97 & BFLS98: Benign overfitting in boosting2021-10-20T00:00:00+00:002021-10-20T00:00:00+00:00http://blog.claytonsanford.com/2021/10/20/boosting<!-- [[OPML#8]](/2021/10/20/boosting.html){:target="_blank"} -->
<p><em>This is the eighth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p><em>In other news, there’s <a href="https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-of-deep-learning-20211011/" target="_blank">a cool Quanta article</a> that touches on over-parameterization and the analogy between neural networks & kernel machines that just came out. Give it a read!</em></p>
<p>When conducting research on the theoretical study of neural networks, it’s common to joke that one’s work was “scooped” by a paper in the 1990s.
There’s a lot of classic ML theory work that was published well before the deep learning boom of the last decade.
As a result, it’s common for researchers to ignore it and unknowingly repackage old ideas as novel.</p>
<p>This week, I finally escape my pattern of discussing papers from the ’10s and ’20s by presenting a pair of seminal papers from the late ’90s: <a href="https://www.sciencedirect.com/science/article/pii/S002200009791504X" target="_blank">FS97</a> and <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-26/issue-5/Boosting-the-margin--a-new-explanation-for-the-effectiveness/10.1214/aos/1024691352.full" target="_blank">BFLS98</a>.
Both of these papers cover <em>boosting</em>, a learning algorithm that aggregates many <em>weak learners</em> (heuristics that perform just better than chance) into a much better prediction rule.</p>
<ul>
<li>FS97 introduces the <em>AdaBoost</em> algorithm, proves that it can combine weak learners to perfectly fit a training dataset, and gives generalization bounds based on VC-dimension.
The authors note that empirically, the algorithm performs much better than these capacity-based bounds and exhibits some form of <em>benign overfitting</em> (which has been extensively discussed in posts like <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="/2021/09/11/xh19.html" target="_blank">[OPML#6]</a>).</li>
<li>BFLS98 addresses that mystery and resolves it by giving a different type of generalization bound, a <em>margin-based bound</em>, which explains why the generalization performance of AdaBoost continues to improve after it correctly classifies the training data.</li>
</ul>
<p>These papers fit into the series because they exhibit a very similar phenomenon to the one we frequently encounter with over-parameterized linear regression and in deep neural networks:
A learning algorithm is trained to zero training error and has small generalization error, despite capacity-based generalization bounds suggesting that this should not occur.
Moreover, the generalization error continues to decrease as the model becomes “more over-parameterized” and continues to train beyond zero training error.
These papers highlight the significance of <em>margin bounds</em>, which have been studied in papers <a href="https://arxiv.org/abs/1909.12292" target="_blank">like</a> <a href="https://arxiv.org/abs/1706.08498" target="_blank">these</a> in the context of neural network generalization.</p>
<p>We’ll jump in by explaining boosting, before discussing capacity-based and margin-based generalization bounds and the connection to benign overfitting.</p>
<h2 id="boosting">Boosting</h2>
<p>We motivate and discuss the boosting algorithm presented in FS97.</p>
<h3 id="population-training-and-generalization-errors">Population, training, and generalization errors</h3>
<p>To motivate the problem, consider a setting where the goal is to learn a classifier from training data.
That is, you (the learner) have \(m\) samples \(S = \{(x_1, y_1), \dots, (x_m, y_m)\} \subset X \times \{-1,1\}\) drawn independently from some distribution \(\mathcal{D}\).
The goal is to learn some <em>hypothesis</em> \(h: X \to \{-1,1\}\) with low population error, that is</p>
\[\text{err}_{\mathcal{D}}(h) = \text{Pr}_{(x, y) \sim \mathcal{D}}[h(x) \neq y].\]
<p>To do so, we follow the strategy of <em>empirical risk minimization</em>, that is choosing the \(h\) that minimizes <em>training error</em>:</p>
\[\text{err}_S(h) = \sum_{i=1}^m \mathbb{1}\{h(x_i) \neq y_i\}.\]
<p>Often, the goal is to obtain a <em>PAC learning</em> (Probably Approximately Correct learning) guarantee, which entails showing that there exists some learning algorithm that gives a hypothesis \(h\) with probability \(1 - \delta\) such that \(\text{err}_{\mathcal{D}}(h) \leq \epsilon\) in time \(O(\frac{1}{\epsilon}, \frac1\delta)\) for any small \(\epsilon, \delta > 0\).</p>
<p>We can decompose the population error into two terms and analyze when algorithms succeed and fail based on the two:</p>
\[\text{err}_{\mathcal{D}}(h) = \underbrace{\text{err}_{\mathcal{D}}(h)-\text{err}_S(h)}_{\text{generalization error}} + \underbrace{\text{err}_S(h).}_{\text{training error}}\]
<p>This framing implies two very different types of failure modes.</p>
<ol>
<li>If the training error is large when \(h\) is an empirical risk minimizing hypothesis, then there is a problem with expressivity. In other words, there is no hypothesis that closely fits the training data, which means that there is very likely no hypothesis will succeed on random samples drawn from \(\mathcal{D}\).</li>
<li>If the generalization error is large, then the sample \(S\) is not representative of the distribution \(\mathcal{D}\). <em>Overfitting</em> refers to the issue where the training error is small and the generalization error is large; the hypothesis does a good job memorizing the training data, but it learns little of the actual underlying learning rule because there aren’t enough samples. This typically occurs when \(h\) comes from a family of hypotheses that are <em>too complex.</em></li>
</ol>
<p>We can visualize these trade-offs with respect to the model complexity below, as they’re understood by traditional capacity-based ML theory. (There’s a very similar image in the introductory post of this blog series.)</p>
<p><img src="/assets/images/2021-10-20-boosting/descent.jpeg" alt="" /></p>
<p>While these blog posts focus on problematizing this picture by exhibiting cases where there is <em>both</em> overfitting and low generalization error, we introduce boosting in the context of solving the opposite problem: What do you do when the model complexity is too low, and no hypotheses do a good job of even fitting the training data?</p>
<h3 id="limitations-of-linear-classifiers">Limitations of linear classifiers</h3>
<p>Consider the following picture:</p>
<p><img src="/assets/images/2021-10-20-boosting/redblue.jpeg" alt="" /></p>
<p>Suppose our goal is to find the best linear classifier that separates the red data (+1) from the blue data (-1) and (ideally) will also separate new red data from new blue data.
However, there’s an immediate problem: no linear classifier can be drawn on the training data without a training error better than \(\frac13\). For instance, the following separator (which labels everything with \(\langle w, x\rangle > 0\) red and everything else blue) for some vector \(w \in \mathbb{R}^2\) performs poorly on the upper “slice” of red points and the lower slice of blue points.</p>
<p><img src="/assets/images/2021-10-20-boosting/line1.jpeg" alt="" /></p>
<p>Neither of these are any good either.</p>
<p><img src="/assets/images/2021-10-20-boosting/line23.jpeg" alt="" /></p>
<p>All three of the above linear separators have roughly a \(\frac23\) probability of classifying a sample correctly, but they each miss a different slice of the data.
A natural question to ask is: Can these three separators be combined in some way to improve the training error of the classifier?</p>
<p>The answer is yes. By taking a <em>majority vote</em> of the three, one can correctly classify all of the data. That is, if at least two of the three linear classifiers think the point is red, then the final classifier predicts that the point is red.
The following is a visualization of how this voting scheme works. (Maroon regions have 2 separators saying “red” and are classified as red. Purple regions have 2 separators saying “blue” and are classified as blue.)</p>
<p><img src="/assets/images/2021-10-20-boosting/vote.jpeg" alt="" /></p>
<p>We increase the complexity of the model (by aggregating together three different classifiers), which gets us down to zero training error in this case.
This helps solve the issue about approximation–but it presents a new one on generalization. Can we expect this new “voting” classifier to perform well, since it’s more complex than just the linear classifier?</p>
<p><em>Boosting</em> is an algorithm that formalizes this voting logic in order to string together a bunch of weak classifiers into one that performs well on all of the training data. In the last two sections of the blog post, we give two takes on generalization of boosting approaches, to answer the aforementioned question about whether we expect this kind of overfitting to hurt or not.</p>
<h3 id="weak-learners">Weak Learners</h3>
<p>The linear classifiers above are examples of <em>weak learners</em>, which perform slightly better than chance on the training data and which we combine together to make a stronger learner.</p>
<p>To formalize that concept, we say that a learning algorithm is a <em>weak learning algorithm</em> or a <em>weak learner</em> if it can PAC-learn a family of functions \(\mathcal{C}\) with error \(\epsilon = \frac12 - \eta\) with probability \(1- \delta\) where samples are drawn from some distribution \(\mathcal{D}\).</p>
<p>The idea with weak learning in the context of boosting is that you use the weak learning algorithm to obtain a classifier \(h\) that weak-learns the family over some weighted distribution of the samples.
Then, the distribution can be modified accordingly, in order to ensure that the next weak learner performs well on the samples that the original hypothesis performed poorly on.
In doing so, we gradually find a cohort of weak classifiers, such that each sample is correctly classified by a large number of weak learners in the cohort.</p>
<p><img src="/assets/images/2021-10-20-boosting/wl.jpeg" alt="" /></p>
<p>The graphic visualizes this flow.
The top-right image represents the first weak classifier found on the distribution that samples evenly from the training data. It performs well on at least \(\frac23\) of the samples.
Then, we want the weak learning algorithm to give another weak classifier, but we want it to be different and ensure that other samples are correctly classified, particularly the ones misclassified by the first one.
Therefore, we amplify those misclassified samples in the distribution (bottom-left) and learn a new learning rule on that reweighted distribution.
For that learning rule to qualify as a weak learner, it must classify \(\frac23\) of the <em>weighted</em> samples correctly. To do so, it’s essential that it correctly classifies the previously-misclassified samples.
Hence, it chooses a different rule.
Continuing to iterate this will give a wide variety of weak learners.</p>
<p>This intuition is formalized in the AdaBoost algorithm.</p>
<h3 id="adaboost">AdaBoost</h3>
<p>Here’s how the algorithm works, as stolen from FS97.</p>
<ul>
<li>Input: some input set of samples \((x_1, y_1), \dots, (x_m, y_m)\), a number of rounds \(T\), and a procedure <strong>WeakLearn</strong> that outputs a weak learner given a distribution over samples.</li>
<li>Initialize \(w^1 = \frac{1}{m} \vec{1} \in [0,1]^m\) to be a uniform starting distribution over training samples. (Note: the algorithm in the paper works for a general starting distribution, but we stick to the uniform distribution for simplicity.)</li>
<li>For round \(t \in [T]\), do the following:
<ol>
<li>Update the probability distribution by normalizing the current weight vector: \(p^t = \frac{1}{\|w^t\|_1} w^t.\)</li>
<li>Use <strong>WeakLearn</strong> to obtain a weak learner \(h_t: X \to [-1,1]\).</li>
<li>Calculate the error of \(h^t\) on the <em>weighted</em> training samples: \(\epsilon_t = \frac12 \sum_{i=1}^m p_i^t \lvert h_t(x_i) - y_i\rvert\). (Note: this differs by a factor of \(\frac12\) from the version presented in the paper because we assume the output of the functions to be \([-1,1]\) rather than \([0,1]\).)</li>
<li>Let \(\beta_t = \frac{\epsilon_t}{1 - \epsilon_t} \in (0,1)\) inversely represent roughly how much weight should be assigned to \(h^t\) in the final classifier. (If \(h^t\) has small error, then it’s a “helpful” classifier that should be given more priority.)</li>
<li>Adjust the weight vector by de-emphasizing samples that were accurately classified by \(h_t\). For all \(i \in [m]\), let</li>
</ol>
\[w_i^{t+1} = w_i^t \beta_t^{1 - |h_t(x_i) - y_i|}.\]
</li>
<li>
<p>Output the final classifier, a weighted majority vote of the weak learners:</p>
\[h_f(x) = \text{sign}\left(\sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t} \right).\]
<p>(This also differs from the final hypothesis in the paper because of the difference in output.)</p>
</li>
</ul>
<p>This formalizes the process illustrated above, where we rely on <strong>WeakLearn</strong> to produce learning rules that perform well on samples that have been misclassified frequently in the past.</p>
<p>Why is it called <strong>Ada</strong>Boost?
Unlike previous (less famous) boosting algorithms, it doesn’t require that all of the weak learners have minimum accuracy that is known to the algorithm.
Rather, it can work with all errors \(\epsilon_t\) and hence <em>adapt</em> to the samples given.</p>
<p>It’s natural to ask about the theoretical properties of the algorithm.
Specifically, can AdaBoost successfully aggregate a bunch of weak learners into a “strong learner” that classifies all but an \(\epsilon\) fraction of the training samples for any \(\epsilon\)?
And if so, how many rounds \(T\) are needed?
And how small must we expect \(\epsilon_t\) (the accuracy of each weak learner) to be?
This leads us to the main AdaBoost theorem.</p>
<p><em><strong>Theorem 1</strong> [Performance of AdaBoost on training data, Theorem 6 of FS97]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the error of the final hypothesis \(h_f\) is bounded by</em></p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)}.\]
<p>From this, one can naturally ask: How long will it take to classify all of the training data? For that to be the case, it suffices to show that \(\epsilon < \frac1m\), because there are only \(m\) samples and they cannot be “fractionally” correct.</p>
<p>For the sake of simplicity, we calculate the \(T\) necessary for \(\epsilon_t \leq 0.4\). (That is, the error of each weak learner has advantage at most 0.1.)</p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)} \leq 2^T (0.24)^{T/2} = (2 \sqrt{0.24})^T < \frac{1}{m},\]
<p>which occurs when</p>
\[T > \frac{\log m}{\log (1 / (2 \sqrt{0.24}))} \approx 113 \log m.\]
<p>This is a really nice bound to have! It tells us that the training error can be rapidly bounded, despite only having the ability to aggregate classifiers that perform slightly better than chance.</p>
<p>The proof is simple and elegant, and I’m not going into it much.
It’s well-explained by the paper, but much of it boils down to the intuition that if a training sample is neglected by many weak learners, then its emphasis continues to increase until it can no longer be ignored without meeting the weak learnability error guarantees.</p>
<p>Despite all of these nice things, this theorem is limited. It only covers the performance of the weighted majority classifier on the training data and says nothing about generalization.
Indeed, it’s reasonable to fret about the generalization performance of this aggregate classifier.
If we substantially increased the expressibility of the weak learning classifiers by combining them, then wouldn’t capacity-based generalization theory tell us that this will trade-off generalization?
And isn’t it further compromised by the fact that training for a relatively small number of rounds leads to an aggregate hypothesis that perfectly fits the training data?</p>
<p>We focus for the remainder of the post on generalization, first examining it through the lens of classical capacity-based generalization theory, as done by FS97.</p>
<h2 id="capacity-based-generalization">Capacity-based generalization</h2>
<p>Looking back on the first visual of this post, classical learning theory has a simple narrative for what boosting does:</p>
<ul>
<li>The individual weak classifiers provided by <strong>WeakLearn</strong> lie on the left side of the curve (low generalization error, high training error) because they have a poor training error. Thus, they cannot fit complex patterns and are likely intuitively “simple,” which could translate to a low VC-dimension and hence a low generalization error.</li>
<li>As each stage of the boosting algorithm runs, the aggregate classifier moves further to the right, improving training error at the cost of generalization error. After sufficiently many rounds \(T\) have occurred to drive the training error to zero, the generalization will be so large as to make any bound on population error vacuous.</li>
</ul>
<p>This intuition is made explicit by the generalization bound presented by FS97, which bounds the VC-dimension of a majority vote of classifiers with individual VC-dimension at most \(d\) and applies the standard VC-dimension bound on generalization.</p>
<p>They get the following bound, which combines their Theorem 7 and Theorem 8.</p>
<p><em><strong>Theorem 2</strong> [Capacity-based generalization bound] Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Suppose <strong>WeakLearn</strong> outputs hypotheses from a class \(\mathcal{H}\) having \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), the following inequality holds for all final hypotheses that can returned by AdaBoost \(h_f\):</em></p>
\[\text{err}_{\mathcal{D}}(h_f) \leq \underbrace{\text{err}_{S}(h_f)}_{\text{training error}} + \underbrace{O\left(\sqrt{\frac{dT\log(T)\log(m/dT) + \ln\frac1{\delta}}{m}}\right).}_{\text{generalization error}}\]
<p>This bound fits cleanly into the intuition described above.
To keep the generalization small, \(T\) and \(d\) must be kept small relative to the number of samples. Doing so forces the training error to be large, because Theorem 1 suggests that \(h_f\) will have small training error when (1) AdaBoost runs for many iterates (large \(T\)) or (2) <strong>WeakLearn</strong> produces accurate classifiers, which requires an expressive family of weak learners (large \(d\)).
Hence, we’re necessarily trading off the two types of error.</p>
<p>However, this isn’t the full story.
When running experiments, they confirmed that after many rounds, the training error approached zero (as expected by Theorem 1).
But they also found that the test error dropped along with the training error <em>and</em> that the test error continued to drop even after the training error went to zero.
To explain this phenomenon, we turn to BFLS98, where the authors explain this low generalization error using <em>margin-based</em> bounds rather than capacity-based bounds.</p>
<p><img src="/assets/images/2021-10-20-boosting/general.jpeg" alt="" /></p>
<h2 id="margin-based-generalization">Margin-based generalization</h2>
<p>A key idea in the story about margin-based generalization is that a classifier that correctly and <em>decisively</em> categorizes all the training data is more robust (and more likely to generalize) than one that nearly categorizes samples incorrectly.
Roughly, slightly perturbing the samples in the first case will lead to samples that have the same labels, while that may not be the case in the second case.</p>
<p>Analyzing this requires considering some notion of <em>margin</em>, which quantifies the decisiveness of the classification.
For now, consider a modified version of the weighted majority classifier derived from AdaBoost:</p>
\[h_f(x) = \sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t}.\]
<p>The only difference here is that we dropped the \(\text{sign}\) function, which means the output may be anywhere in \([-1,1]\).
\(h_f\) categorizes the sample \((x,y)\) correctly if \(yh_f(x) > 0\), because the sign of \(h_f\) will then match \(y\).
We say that \(h_f\) categorizes a sample correctly <em>with margin \(\theta > 0\)</em> if \(yh_f(x) \geq \theta\).
This means that–if \(h_f\) is an aggregation of a large number of weak classifiers–then a small number of those classifiers changing their outcomes will not change the overall outcome of \(h_f\).</p>
<p>There are two key steps that lead to new generalization bounds by BFLS98 for AdaBoost.</p>
<ol>
<li>AdaBoost (after sufficiently many rounds \(T\) and with sufficiently small weak learner errors \(\epsilon_t\)) will classify the sample \(S\) correctly with some margin \(\theta\).</li>
<li>Any linear combination of \(N\) classifiers (each of which has bounded VC dimension) with margin \(\theta\) on the training data has a generalization bound that depends on \(\theta\) and <em>not</em> on \(N\).</li>
</ol>
<p>They accomplish (1) by proving a theorem that is very similar in flavor and proof to the Theorem 1 we gave earlier.</p>
<p><em><strong>Theorem 3</strong> [Margins of AdaBoost on training data, Theorem 5 of BFLS98]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the final hypothesis \(h_f: X \to [-1,1]\) satisfies the following margin bound on the training set \((x_1, y_1), \dots, (x_m, y_m)\) for any \(\theta \in [0,1)\):</em></p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i) \leq \theta \}\leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t^{1-\theta}(1 - \epsilon_t)^{1 + \theta}}.\]
<p>To make matters more concrete once again, consider the case where \(\eta_t \leq 0.4\) as before.
Then, the bound gives</p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} \leq 2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2}.\]
<p>If we want all training samples to obey the condition, we enforce that the margin term is less than \(\frac1{m}\).
Consider two cases:</p>
<ul>
<li>By some calculations (with the help of WolframAlpha), if \(\theta = 0.1\), then \(y_i h_f(x_i) \geq \theta\) for all \(i \in [m]\) if \(T > 7260 \log m\). This is very similar to our application of Theorem 1, albeit with bigger constants.</li>
<li>
<p>If \(\theta = 0.2\), then</p>
\[2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2} = 2^T (0.4)^{0.4T}(0.6)^{0.6T} \approx 1.02^T,\]
<p>which means that the bounds can never guarantee that the margins will be that large with time.</p>
</li>
</ul>
<p>These bounds provide a way of finding a margin \(\theta\) dependent on \(T\) and errors \(\epsilon_1, \dots, \epsilon_T\), which will be useful in the second part.</p>
<p>To get (2), they prove a bound on the combination of weak learners with margin bounds.</p>
<p><em><strong>Theorem 4</strong> [Margin-based generalization; Theorem 2 of BFLS98]: Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Let \(\mathcal{H}\) be a family of “base classifiers” (weak learners) with \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), any weighted average \(h_f(x) = \sum_{j=1}^T p_j h^j(x)\) for \(p_j \in [0,1]\), \(\sum_j p_j = 1\), and \(h^j \in \mathcal{H}\) satisfies the following inequality:</em></p>
\[\text{err}_{\mathcal{D}}(h_f) = \text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} + O\left(\sqrt{\frac{d \log^2(m/d)}{m\theta^2} + \frac{\log(1/\delta)}{m}}\right).\]
<p>This is fantastic compared to Theorem 2 because the generalization bound does not worsen as \(T\) increases.
The opposite effect actually occurs: as AdaBoost continues to run, Theorem 3 shows that the margin increases (up to a point), which strengthens the bound without trade-off!</p>
<p>We can instantiate the bound in the setting described above to show what a nice generalization bound can look like for boosting. If, once again, \(\eta_t \leq 0.4\), then taking \(\theta = 0.1\) and \(T = 7260\log m\) gives</p>
\[\text{err}_{\mathcal{D}}(h_f) = O\left(\sqrt{\frac{d \log^2(m/d) + \log(1/\delta)}{m}} \right).\]
<p>In this case, we can have our cake and eat it too; we increase the model complexity and expressivity by increasing \(T\), but we don’t sustain the basic trade-offs between training and generalization error discussed at the beginning of the post.</p>
<p>To illustrate why, we give a high-level overview of the proof and show how the rough intuition that “decisive classification leads to robustness, leads to generalization” holds up.</p>
<ul>
<li>The proof uses an approximation of \(h_f = \sum_{j=1}^T p_j h^j\) by sampling \(N\) classifiers \(\hat{h}_1, \dots, \hat{h}_N\) independently from \(h^1, \dots, h^T\) weighted by \(p_1, \dots, p_T\). It averages them together to obtain \(g = \frac1{N} \sum_{k=1}^N \hat{h}_k.\)</li>
<li>
<p>The proof decomposes the population error term into other quantities by using properties of conditional probability:</p>
\[\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}\right] + \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}, y h_f(x) \leq 0\right].\]
</li>
<li>The second term can be shown to be small when \(N\) and \(\theta\) large with high probability over \(g\) by a Chernoff bound. Since \(h_f = \mathbb{E}[g] = \mathbb{E}[\hat{h}_k]\), it’s unlikely that \(yg(x)\) and \(yh_f(x)\) will differ by a large factor from one another.</li>
<li>By principles of VC dimension, the <a href="https://en.wikipedia.org/wiki/Sauer%E2%80%93Shelah_lemma" target="_blank">Sauer-Shelah lemma</a>, and concentration bounds (this time over the <em>sample</em>) for large \(m\), the first term will be roughly the same as \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}.\)</li>
<li>
<p>Using the same conditional probability argument as before, that same term can be decomposed into</p>
\[\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \} \leq \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i f(x_i) \leq \theta \} + \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 , y_i f(x_i) \leq \theta\}.\]
</li>
<li>Using Chernoff bounds shows the second term of the expression is small with high probability over \(g\). Thus, the \(\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0]\) is approximately \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}\), plus an error term that accumulates as a result of the concentration bounds.</li>
<li>Having a large \(\theta\) means that we have plenty of room for the Chernoff bounds over \(g\) to be strong, which corresponds to the <em>robustness</em> discussed before. If \(\theta\) were small, then it would be very easy to have \(yf(x) \leq 0\) and \(yg(x) \geq \theta/2\) simultaneously, which would make the argument impossible.</li>
</ul>
<h2 id="last-thoughts">Last thoughts</h2>
<p>I read these boosting papers in 2017 while taking my first graduate seminar, which surveyed a variety of papers in ML theory.
I enjoyed the papers then, but the remarkability of this generalization result was lost on me at the time.
Now, I find this much more exciting because it gives a setting where a model can obtain provably great generalization error despite overfitting the data and being “over-parameterized.” (If we count the number of parameters used in all of the classifiers that vote, there can be many more parameters than samples \(m\).)
The proof is elegant and does not require strange and adversarial distributions over training data.
Granted, the assumption that there exists a weak learner that always returns a classifier with error at most (say) 0.4 is a strong one, but the result is remarkable nonetheless.</p>
<p>Thanks for reading! Leave a comment if you have any thoughts or questions. (As long as the comments system isn’t buggy on your end–I’m still sorting out some issues.) See you next time!</p>Clayton Sanford[OPML#7] BLN20 & BS21: Smoothness and robustness of neural net interpolators2021-09-22T00:00:00+00:002021-09-22T00:00:00+00:00http://blog.claytonsanford.com/2021/09/22/bubeck<p><em>This is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>This post discusses two papers by Sebastian Bubeck and his collaborators that are of interest to the study of over-parameterized neural networks. The first, <a href="https://arxiv.org/abs/2009.14444" target="_blank">“A law of robustness for two-layers neural networks” (BLN20)</a> with Li and Nagaraj, gives a conjecture about the “robustness” of a two-layer neural network that interpolates all of the training data. The second, <a href="https://arxiv.org/abs/2105.12806" target="_blank">“A universal law of robustness via isoperimetry” (BS21)</a> with Sellke, proves part of the conjecture and extends that part of the conjecture to deeper neural networks.
The other part of the conjecture remains open for future work to tackle.</p>
<p>Both papers consider a setting where there are \(n\) training samples \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}\) drawn from some distribution that are fit by a neural network with \(k\) neurons.
For the two-layer case (which we’ll focus on in this writeup), they consider neural networks of the form</p>
\[f(x) = \sum_{j=1}^k u_j \sigma(w_j^T x + b_j),\]
<p>where \(\sigma(t) = \max(0, t)\) is the ReLU activation function and \(w_j \in \mathbb{R}^d\) and \(b_j, u_j \in \mathbb{R}\) are the parameters.
Roughly, they ask whether there exists a “smooth” neural network \(f\) such that \(f(x_i) \approx y_j\) for all \(j \in [n]\); this makes \(f\) an approximate interpolator.</p>
<p><em>How does this relate to the rest of this blog series?</em>
All of the other posts so far have been about cases where over-parameterized linear regression leads to favorable generalization performance.
These generalization results occur due to the smoothness of the linear prediction rule.
That is, if we have some prediction rule \(x \mapsto \beta^T x\) for \(x, \beta \in \mathbb{R}^d\) with \(d \gg n\), we might have good generalization if \(\|\beta\|_2\) is small, which is enabled when \(d\) is very large.
The same observation holds up with neural networks (over-parameterized models leads to benign overfitting), but it’s harder to prove why it leads to a small generalization error.
By understanding the smoothness of interpolating neural networks, it might make it easier to prove generalization bounds on the neural networks that perfectly fit the training data.</p>
<p><em>How do they measure smoothness?</em>
For linear regression, it’s natural to think of the smoothness of the prediction rule \(f_{\text{lin}}(x) = \beta^T x\) as \(\|\beta\|_2\), since that is the magnitude of the gradient \(\|\nabla f_{\text{lin}}(x)\|_2\) at every sample \(x\).
For two-layer neural networks—which are non-linear functions—it’s natural instead to consider the maximum norm of the gradient of \(f\), which is represented by the Lipschitz constant of \(f\): the minimum \(L\) such that \(|f(x) - f(x')| \leq L \|x - x'\|_2\) for all \(x, x'\). (Lipschitzness also comes up frequently in my <a href="/2021/08/15/hssv21.html" target="_blank">COLT paper about the approximation capabilities of shallow neural networks</a>.)</p>
<p><em>What does it have to do with robustness?</em>
Typically, robustness is discussed in the context of adversarial examples.
If you’ve hung around the ML community, you’ve probably seen this issue featured in images like this:</p>
<p><img src="/assets/images/2021-09-22-bubeck/panda.png" alt="" /></p>
<p>Here, an image of a panda is provided that a trained image classification neural network clearly identifies as such.
However, a small amount of noise can be added to the image that leads to the network being tricked into thinking that it’s a gibbon instead.
Put roughly, it means that the network outputs \(f(x) = \text{"panda"}\) and \(f(x + \epsilon \tilde{x}) = \text{"gibbon"}\) for some \(x\) and \(\tilde{x}\), which means that the output of \(f\) changes greatly near \(x\).
By mandating that \(f\) have a small Lipschitz constant, these kinds of fluctuations are impossible.
This makes the network \(f\) <em>robust</em>.
Thus, enforcing smoothness conditions is a way to ensure that a predictor is robust to these kinds of adversarial examples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/smooth.jpeg" alt="" /></p>
<p>As a result, Bubeck and his collaborators want to characterize the availability of interpolating networks \(f\) that are also robust, with the hopes of understanding how over-parameterization can be used to avoid having adversarial examples.</p>
<p>One important caveat: Unlike the previous papers discussed in this series, this one focuses only on approximation and not optimization.
It asks whether <em>there exists</em> an interpolating prediction rule that is smooth, but it does not ask whether this rule can be easily obtained from stochastic gradient descent.</p>
<p>For the rest of the post, I’ll discuss the conjecture made by BLN20, share the support for the conjecture that was provided by BLN20 and BS21, and discuss what remains to be studied in this space.</p>
<h2 id="the-conjecture">The conjecture</h2>
<p>For simplicity, BLN20 considers only samples drawn uniformly from the unit sphere: \(x \in \mathbb{S}^{d-1}= \{x \in \mathbb{R}^d: \|x\|_2=1\}\) with iid labels \(y_i \sim \text{Unif}(\{-1,1\})\).
The conjecture of BLN20, which combines their Conjectures 1 and 2 is as follows:</p>
<p><em>Consider some \(k \in [\frac{cn}{d}, Cn]\) for constants \(c\) and \(C\). With high probability over \(n\) random samples from some distribution, there exists a 2-layer neural network \(f\) of width \(k\) that perfectly fits the data such that \(f\) is \(O(\sqrt{n/k})\)-Lipschitz.
Furthermore, any neural network that fits the data must be \(\Omega(\sqrt{n/k})\)-Lipschitz with high probability.</em></p>
<p>If true, the conjecture suggests there can only be an \(O(1)\)-Lipschitz interpolating neural network \(f\) if the model is highly over-parameterized, or \(k = \Omega(n)\).
Note that \(k\) is the number of neurons, and not the number of parameters.
In the case of a 2-layer neural network, the number of parameters is \(p = kd\), so there must be at least \(p = \Omega(nd)\) parameters for the interpolating network to be smooth.</p>
<p>The conditions with constants \(c\) and \(C\) are necessary for the question to be well-posed.</p>
<ul>
<li>Without the \(k \leq Cn\) constraint, there theorem would imply the existence of neural networks that fit the data and are \(o(1)\)-Lipschitz. However, this is not possible unless all training samples are have the same label \(y_i\); otherwise, there are at least two different samples \(x_i\) and \(x_j\) that are at most distance 2 apart (since both lie on \(\mathbb{S}^{d-1}\)) and have opposite labels. This implies that any function fitting both samples must be at least 1-Lipschitz.</li>
<li>Without the \(k \geq \frac{cn}{d}\) constant, there is unlikely to exist any neural network with \(k\) neurons that can fit the \(n\) samples. Since the number of parameters \(p\) is roughly \(kd\), letting \(k \ll \frac{n}{d}\) would ensure that \(p \ll n\) and there are fewer parameters than samples. Intuitively, it’s difficult to fit a large number of points with random labels when there are fewer parameters than samples. This suggests that the model must be over-parameterized for interpolation to even occur in the first place, let alone be smooth.</li>
</ul>
<p>BLN20 shows that the conjecture holds up empirically on toy data.
For many values of \(n\) and \(k\), they train several neural networks to fit the \(n\) samples with 2-layer neural networks of width \(k\) and randomly sample gradients to find the one with the largest magnitude.
When plotted, they note a nice linear relationship between the norms of the largest random gradient and \(\sqrt{n/k}\).
Of course, the maximum random gradient is not the same as the Lipschitz constant, since it’s impossible to check the gradient for all values of \(x\) simultaneously, but this suggests that it’s likely that the conjecture is correct.</p>
<p><img src="/assets/images/2021-09-22-bubeck/plot.png" alt="" /></p>
<h2 id="partial-upper-bounds-from-bln20">Partial upper bounds from BLN20</h2>
<p>The BLN20 papers focuses on presenting the conjecture and giving a series of partial results that suggest it may be true. In this section, we give a brief summary of each of the partial solutions.</p>
<p>The following are all partial solutions to the upper bound. That is, they show weaker versions of the claim that there exists neural network \(f\) with Lipschitz constant \(O(\sqrt{n/ k})\) by showing either larger bounds on the Lipschitz constant or more restrictive parameter regimes.</p>
<ul>
<li><strong>The high-dimensional case (3.1).</strong> If \(d \gg n\), then a ReLU network with a single neuron \(k = 1\) can be used to perfectly fit the data.
This is because a single \(d\)-dimensional hyperplane will be able to fit the \(n\) samples, so one can just choose the hyperplane with the lowest magnitude that fits the data and use a ReLU that corresponds to that hyperplane. By similar analysis to that of linear regression, the Lipschitz constant of this network will be \(O(\sqrt{n})\) with high probability, which is the same as \(O(\sqrt{n/ k})\). This can’t be improved without using more neurons.
<img src="/assets/images/2021-09-22-bubeck/single.jpeg" alt="" /></li>
<li><strong>The wide (“optimal size”) regime: \(k = n\) (3.2).</strong> With high probability, an \(10\)-Lipschitz network \(f\) can be provided by using a ReLU for every sample. Each ReLU is treated as a “cap” that gives a sample the correct label. With high probability, the points will be sufficiently spread apart in \(\mathbb{S}^{d-1}\) to ensure that none of the the caps overlap. This makes the norm of the gradient never more than \(10\), if each cap is offset by \(\frac{1}{10}\).
<img src="/assets/images/2021-09-22-bubeck/cap.jpeg" alt="" /></li>
<li><strong>The compromise case (3.3).</strong> The two previous approaches can be combined for a broader choice of \(k\) and \(n\) by instead having each ReLU perfectly fit \(m := n/k \leq d\) samples in a cap. However, since these are bigger and more complex caps then before, we need to be more concerned about the caps overlapping. They show that \(O(m \log d)\) caps will overlap at any given point, which means that the Lipschitz constant will be \(O(n\log (d) / k)\). Even disregarding the logarithmic factor, this is still much weaker than the \(O(\sqrt{d/k})\) factor that the conjecture desires.
<img src="/assets/images/2021-09-22-bubeck/combo.jpeg" alt="" /></li>
<li><strong>The very low-dimensional case with a weird architecture (3.4).</strong>
They prove the existence of a neural network that fits \(n\) samples and has Lipschitz constant \(O(\sqrt{n / k})\) with high probability. To do so, however, they need several major caveats:
<ul>
<li>The dimension \(d\) is very small; for some constant even integer \(q\), \(k = C_q d^{q-1}\) and \(n \approx \frac{d^q}{100 q \log d}\), where \(C_q\) depends on \(q\). Note that the number of neurons \(k\) can be much bigger than the number of samples \(n\) when \(d\) is very small and \(q\) is large.</li>
<li>\(f\) approximately interpolates the samples. That is, \(\lvert f(x_i) - y_i\rvert \leq 0.1 C_q\) for all \(i \in [n]\). (Note that 0.1 can be replaced by \(\epsilon\) and the result can be generalized.)</li>
<li>The neural network uses the activations \(t \mapsto t^q\) and not the ReLU function.</li>
</ul>
<p>This can be thought of as a tensor interpolation problem. Specifically, for \(q = 2\), they perform regression on the space \(x^{\otimes 2} = (x_1^2, x_1x_2, \dots, x_1 x_d,\dots x_2x_1, x_2^2, \dots, x_d^2)\) using the quadratic activation function.
This approach gives the kind of bound they’re looking for, but is a strange enough case that it’s unclear how to extend this to networks with (1) high input dimensions, (2) perfect interpolation, and (3) ReLU activations.</p>
</li>
</ul>
<p>The paper also gives a few constrained versions of the lower bound on the Lipschitz constant for any interpolating function. However, we omit them here because the second paper—BS21—has much better lower bounds.</p>
<h2 id="lower-bound-from-bs21">Lower bound from BS21</h2>
<p>The follow-up paper proves a mostly-tight lower bound, which effectively resolves half of the conjecture.
The results require <em>isoperimetry</em> to hold, which is true of a random variable \(x \in \mathbb{R}^d\) if \(f(x)\) has subgaussian tails for every Lipschitz function \(f\).
This holds for well-known distribution such as (1) multivariate Gaussian distributions, (2) the uniform distribution on \(\mathbb{S}^{d-1}\), (3) and the uniform distribution on the hypercube \(\{-1, 1\}^d\).</p>
<p>By combining their Lemma 3.1 and Theorem 3, the following statement is true about 2-layer neural networks:</p>
<p><em>Let \(\mathcal{F}\) be a family of 2-layer neural networks of width \(k\) with parameters in \([-W, W]\). Suppose each sample \((x_i, y_i)\) is drawn from isoperimetric distribution for all \(i \in [n]\) with \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\) and such that \(\| x_i \|_2 \leq R\) almost surely. Then, with high probability, any neural network \(f \in \mathcal{F}\) that perfectly fits all \(n\) training samples will have a Lipschitz constant of</em></p>
\[\Omega\left(\sqrt{\frac{n}{k \log (W R nk)}}\right).\]
<p>This is close to the conjecture up to logarithmic factors! In addition, this result is more general in the paper:</p>
<ul>
<li>Instead of considering only depth-2 neural networks, they consider all parametric models that change by bounded amounts as their parameter vectors change.</li>
<li>Within their study of neural networks, their analysis also addresses networks that share parameters.</li>
<li>A parameter \(\epsilon\) allows them to conclude that all networks that <em>nearly interpolate</em> must have high Lipschitz constant, not just those that perfectly fit the data.</li>
</ul>
<p>They also account for the fact that the bounds on parameter weights with \(W\) are necessary. Through their Theorem 4, they show the existence of a neural network with a small Lipschitz constant that approximates nearly all of the samples with only a single parameter.
Thus, without these kinds of assumptions, the conjecture is rendered uninformative.</p>
<p>The proof works by considering some fixed \(L\)-Lipschitz function \(f\) and asking how likely it is that \(n\) random samples are almost perfectly fit by \(f\).
By isoperimetry, this can be shown to happen with very low probability.
Then, by making use of an \(\epsilon\)-net argument, one can show that no \(L\)-Lipschitz function \(f\) can perfectly fit the samples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/cover.jpeg" alt="" /></p>
<p>While I breezed over the argument here, it’s a relatively simple one that can be followed by most people with some background in concentration inequalities.</p>
<h2 id="further-questions">Further questions</h2>
<p>While the second paper resolves half of the open question from the first paper, the other half (the existence of a smooth interpolating neural network) remains open.</p>
<p>There are also a few caveats from the second paper that remain to be resolved. For one, it may be possible to loosen the restriction that there be non-zero label noise (i.e. \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\)).
In addition, the fact that \(\|x_i\|\) must always be bounded is a weakness, since it rules out Gaussian inputs; perhaps this could be improved.</p>
<p>Thanks for tuning in to this week’s blog post! See you next time!</p>Clayton SanfordThis is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam. Check out this post to get an overview of the topic and a list of what I’m reading.[OPML#6] XH19: On the number of variables to use in principal component regression2021-09-11T00:00:00+00:002021-09-11T00:00:00+00:00http://blog.claytonsanford.com/2021/09/11/xh19<!-- [XH19](https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf){:target="_blank"} [[OPML#6]](/2021/09/11/xh19.html){:target="_blank"} -->
<p><em>This is the 6th of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>Here’s another <a href="https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf" target="_blank">paper</a> by my advisor Daniel Hsu and his former student Ji (Mark) Xu that discusses when overfitting works in linear regression.
This one differs subtly from some of the previously discussed papers (like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>) in that it considers <em>principal component regression</em> (PCR) rather than least-squares regression.</p>
<h2 id="principal-component-regression">Principal component regression</h2>
<p>Suppose we have a collection of \(n\) samples \((x_i, y_i) \in \mathbb{R}^{N} \times \mathbb{R}\), which we collect in design matrix \(X \in \mathbb{R}^{n \times N}\) and label vector \(y \in \mathbb{R}^n\).
The standard approach to least-squares regression (which has been given numerous times on this blog) is to choose the \(\hat{\beta}_\textrm{LS} \in \mathbb{R}^N\) that minimizes \(X \hat{\beta}_\textrm{LS} = y\), breaking ties by minimizing the \(\ell_2\) norm \(\|\hat{\beta}_{\textrm{LS}}\|_2\).
This approach considers all dimensions of the inputs \(x_i\).</p>
<p>However, there might a situation where we know \(\Sigma\) a priori and only want to consider the directions in \(\mathbb{R}^N\) that the inputs meaningfully vary along.
This is where <a href="https://en.wikipedia.org/wiki/Principal_component_regression" target="_blank">principal component regression</a> comes in.
Instead of regressing on the training data itself, we regress on the \(p\) most significant dimensions of the data, as identified by <a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal component analysis</a> (PCA).
PCA is a linear dimensionality reduction method that obtains a lower-dimensional representation of \(X\) by approximating each sample as a linear combination of the \(p\) eigenvectors of \(X^T X\) with the largest corresponding eigenvalues.
These \(p\) eigenvectors correspond to the directions in \(\mathbb{R}^N\) where the samples in \(X\) have highest variance.
Moreover, projecting each of the \(n\) samples \(x_i\) onto the space spanned by these \(p\) eigenvectors provides the closest average \(\ell_2\)-approximation of each \(x_i\) as a linear combination of \(p\) fixed vectors in \(\mathbb{R}^N\).</p>
<p>Let \(\mathbb{E}[x_i] = 0\) and \(\Sigma = \mathbb{E}[x_i x_i^T]\) be the covariance matrix of \(x_i\).
If we know \(\Sigma\) ahead of time, then we can simplify things by using only the eigenvectors of \(\Sigma\), rather than the empirical principal components taken from eigenvectors of \(X^T X\).
If the \(p\) eigenvectors \(\Sigma\) with the largest eigenvalues are collected in \(V \in \mathbb{R}^{N \times p}\), then we can express the low-dimensional representation of the training samples as \(X V \in \mathbb{R}^{n \times p}\).
By applying linear regression to these new low-dimensional samples and transforming the resulting parameter vector back to \(\mathbb{R}^N\), we get the parameter vector \(\hat{\beta} = V(X V)^{\dagger} y\), where \(\dagger\) denotes the pseudo-inverse.
(On the other hand, the least-squares parameter vector is \(\hat{\beta}_\textrm{LS} = X^{\dagger} y\).)</p>
<p>The below image visualizes the differences between the least squares and PCR regression algorithms.
It shows a toy example where samples \((x, y)\) (in purple) vary greatly in one direction and not much at all in another direction.
PCR only considers the direction of maximum variance and rules the other out, while least squares considers all directions simultaneously.
Therefore, the hypotheses represented by the green hyperplanes look subtly different for each case.</p>
<p><img src="/assets/images/2021-09-11-xh19/vis.jpeg" alt="" /></p>
<p>Note that this formulation of PCR concerns an idealized setting.
Most regression tasks do not give the learner direct access to \(\Sigma\).
However, it’s possible that \(\Sigma\) could be separately estimated with \(\hat{\Sigma}\) and then applied by PCA.
They authors refer to this as “semi-supervised” because the \(\Sigma\) can be estimated with using only unlabeled samples, since none of the labels \(y\) are used in the approximation.
Due to the high cost of obtaining labeled data, a sufficient dataset for kind of estimate may be significantly easier to obtain than a dataset for the general learning task.</p>
<h2 id="learning-model-and-assumptions">Learning model and assumptions</h2>
<p>They make several restrictive assumptions.
The main purpose of this paper is to construct instances where favorable over-parameterization occurs for PCR, rather than exhaustively catalogue when it must occur.</p>
<p>They assume the samples \(x_i\) have independent Gaussian components and that labels \(y_i = \langle x_i, \beta\rangle\) have no noise.
\(\Sigma\) is a diagonal matrix (which must be the case because of the independent components of each \(x_i\)) with entries \(\lambda_1 > \dots > \lambda_N > 0\).
Therefore, PCR will only use the first \(p\) diagonal entries of \(\Sigma\) and the reduced-dimension version of each sample will merely be its first \(p\) entries.</p>
<p>One weird thing about this paper relative to others is that the true parameter vector \(\beta\) is chosen randomly.
This means it’s an “average-case” bound.
They justify this on the grounds that the ability to choose an arbitrary \(\beta\) could lead to all of the weight being put on the \(N-p\) components that will not be included the PCA’d version of \(X\).
This would make it impossible to have non-trivial error bounds.</p>
<h2 id="over-parameterization-and-pcr">Over-parameterization and PCR</h2>
<p>Now, we have three parameters to consider (\(N, p, n\)), rather than the two (\(p, n\)) typically considered in the previous works on over-parameterization.
As before, they think of over-parameterization as the ratio \(\gamma = \frac{p}{n}\), but they must also contend with the ratios \(\alpha = \frac{p}{N}\) (the fraction of dimensions preserved by PCA) and \(\rho = \frac{n}{N}\) (the ratio of samples to original dimension).</p>
<p>Like <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>, they consider what happens when \(N, p, n \to \infty\) and the ratios remain fixed.
Like BLLT19, their results study how over-parameterization is affected as the eigenvalues of \(\Sigma\) change.
In Section 2, they focus on eigenvalues \(\lambda_1, \dots, \lambda_N\) that decay predictably at a polynomial rate.
Theorems 1 and 2/3 characterize what happens to the expected error in the under-parameterized (\(\gamma \leq 1\)) and over-parameterized (\(\gamma > 1\)) respectively.</p>
<ul>
<li>Theorem 1 shows that the shape of the “classical” regime error curve is preserved in the under-parameterized regime, since it shows that the error decreases as \(\alpha\) increases for fixed \(\rho\), up to a point when it decreases until \(\alpha = \rho\) (equivalently, \(p = n\)).</li>
<li>Theorem 2 shows that the expected error in the interpolation regime \(p > n\) converges to some fixed risk quantity, which can be determined by evaluating an intergral and solving for some quantity.</li>
<li>Theorem 3 shows that for any polynomial rate of decay of the eigenvalues, double-descent will occur and the best interpolating prediction rule will perform better than the best “classical” prediction rule.
In the noisy setting, the best interpolating prediction rule will only outperform the best classical rule in the event that the rate of decay is no faster than \(\frac{1}{i}\).</li>
</ul>
<p>To recap, the optimal performance for PCR is obtained in the over-parameterized regime (with \(p > n\)) if and only if eigenvalues \(\lambda_1, \dots, \lambda_N\) decay slowly; rapid decay leads to optimality in the classical regime.
This echoes the results of BLLT19, which shows that too rapid a decay in eigenvalues causes poor performance in the over-paramterized regime (very-much-not-benign overfitting).
However, BLLT19 also requires that the rate of decay not be too slow, which is a non-issue in this regime.</p>
<p>One of the nice things about this paper–which will be expanded on in the weeks to come–is that it separates the number of parameters \(p\) from the dimension \(N\).
Talking about over-parameterization in linear regression is often awkward because the two quantities are coupled, and we are forced to ask whether favorable behavior in the over-parameterized regime is caused by the high dimension or the high parameter count.
We’ll further examine models with separate dimensions and parameter counts when we study random feature models.</p>Clayton SanfordHow many neurons are needed to approximate smooth functions? A summary of our COLT 2021 paper2021-08-15T00:00:00+00:002021-08-15T00:00:00+00:00http://blog.claytonsanford.com/2021/08/15/hssv21<p>In the past few weeks, I’ve written several summaries of others’ work on machine learning theory.
For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, <a href="http://www.cs.columbia.edu/~rocco/" target="_blank">Rocco Servedio</a> and <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>, and another Columbia PhD student, <a href="http://www.cs.columbia.edu/~emvlatakis/" target="_blank">Manolis Vlatakis-Gkaragkounis</a>.
It will be presented this week at <a href="http://learningtheory.org/colt2021/" target="_blank">COLT (Conference on Learning Theory) 2021</a>, which is happening in-person in Boulder, Colorado.
I’ll be there to discuss the paper and learn more about other work in ML theory.
(Hopefully, I’ll put up another blog post after about what I learned from my first conference.)</p>
<p>The paper centers on a question about neural network approximability; namely, how wide does a shallow neural network need to be to closely approximate certain kinds of “nice” functions?
This post discusses what we prove in the paper, how it compares to previous work, why anyone might care about this result, and why our claims are true.
The post is not mathematically rigorous, and it gives only a high-level idea about why our proofs work, focusing more on pretty pictures and intuition than the nuts and bolts of the argument.</p>
<p>If this interests you, you can check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> to learn more about the ins and outs of our work.
There are also two talks—a 90-second teaser and a 15-minute full talk—and a comment thread available on the <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">COLT website</a>.
This blog post somewhat mirrors the longer talk, but the post is a little more informal and a little more in-depth.</p>
<p>On a personal level, this is my first published computer science paper, and the first paper where I consider myself the primary contributor to all parts of the results.
I’d love to hear what you think about this—questions, feedback, possible next steps, rants, anything.</p>
<h2 id="i-whats-this-paper-about">I. What’s this paper about?</h2>
<h3 id="a-broad-background-on-neural-nets-and-deep-learning">A. Broad background on neural nets and deep learning</h3>
<p>As I discuss in the <a href="/2021/07/04/candidacy-overview.html" target="_blank">overview post for my series on over-parameterized ML models</a>, the practical success of deep learning is poorly understood from a mathematical perspective.
Trained neural networks exhibit incredible performance on tasks like image recognition, text generation, and protein folding analysis, but there is no comprehensive theory of why their performance is so good.
I often think about three different kinds of questions about neural network performance that need to be answered.
I’ll discuss them briefly below, even if only only the first question (approximation) is relevant to the paper at hand.</p>
<ol>
<li>
<p><strong>Approximation:</strong> A neural network is a type of mathematical function that can be represented as a hierarchical arrangement of artifical neurons, each of which takes as input the output of previous neurons, combines them together, and returns a new signal. These neurons are typically arranged in <em>layers</em>, where the number of neurons per layer is referred to as the <em>width</em> and the number of layers is the <em>depth</em>.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn.jpeg" alt="" /></p>
<p>Mathematically, each neuron is a function of the outputs of neurons in a previous layer. If we let \(x_1,x_2, \dots, x_r \in \mathbb{R}\) be the outputs of the \(L\)th layer, then we can define a neuron in the \((L+1)\)th layer as \(\sigma(b + \sum_{i=1}^r w_i x_i)\) where \(b \in \mathbb{R}\) is a <em>bias</em>, \(w \in \mathbb{R}^r\) is a weight vector, and \(\sigma: \mathbb{R} \to \mathbb{R}\) is a nonlinear <em>activation function</em>.
If the parameters \(w\) and \(b\) are carefully selected for every neuron, then many layers of these neurons allow for the representation of complex prediction rules.</p>
<p>For instance, if I wanted a neural network to distinguish photos of cats from dogs, the neural network would represent a function mapping the pixels from the input image (which can be viewed as a vector) to a number that is 1 if the image contains a dog and -1 if the image has a cat. Typically, each neuron will correspond to some kind of visual signal, arranged hierarchically based on the complexity of the signal. For instance, a low-level neuron might detect whether a region of the image contains parallel lines. A mid-level neuron may correspond to certain kind of fur texture, and a high-level neuron could identify whether the ears are a certain shape.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn-cat.jpeg" alt="" /></p>
<p>This opens up questions about the expressive properties of neural networks: What kinds of functions can they represent and what kinds can’t they? Does there have to be some kind of “niceness” property of the “pixels to cat” map in order for it to be expressed by a neural network? And how large does the neural network need to be in order to express some kind of function? How does increasing the width increase the expressive powers of the network? How about the depth?</p>
<p><em>This paper asks questions like these about a certain family of shallow neural networks. We focus on abstract mathematical functions—there will be no cats or dogs here—but we believe that this kind of work will better help us understand why neural networks work as well as they do.</em></p>
</li>
<li>
<p><strong>Optimization:</strong> Just because there exists a neural network that can represent the prediction rule you want doesn’t mean it’s possible to algorithmically find that function. The \(w\) and \(b\) parameters for each neuron cannot be feasibly hard-coded by a programmer due to the complexity of these kinds of functions. Therefore, we instead <em>learn</em> the parameters by making use of training data.</p>
<p>To do so, a neural network is initialized with random parameter choices. Then, given \(n\) <em>training samples</em> (in our case, labeled images of cats and dogs), the network tunes the parameters in order to come up with a function that predicts correctly on all of the samples. This procedure involves using an optimization algorithm like <em>gradient descent</em> (GD) or <em>stochastic gradient descent</em> (SGD) to tune a good collection of parameters.</p>
<p>However, there’s no guarantee that such an algorithm be able to find the right parameter settings.
GD and SGD work great in practice, but they’re only guaranteed to work for a small subset of optimization problems, such as <em>convex</em> problems.
The training loss of neural networks is non-convex and isn’t one of the problems that can be provably solved with GD or SGD; thus, there’s no guarantee of convergence here.</p>
<p><em>There’s lots of interesting work on optimization, but I don’t really go into it in this blog.</em></p>
</li>
<li>
<p><strong>Generalization:</strong> I’ll be brief about this, since I discuss it a lot more in <a href="/2021/07/04/candidacy-overview.html" target="_blank">my series on over-parameterized ML models</a>. Essentially, it’s one thing to come up with a function that can correctly predict the labels of fixed training samples, but it’s another entirely to expect the prediction rule to <em>generalize</em> to new data that hasn’t been seen before.</p>
<p>The ML theory literature has studied the problem of generalization extensively, but most of the theory about this focuses on simple settings, where the number of parameters \(p\) is much smaller than the number of samples \(n\). Neural networks often live in the opposite regime; these complex and hierarchical functions often have \(p \gg n\), which means that classical statistical approaches to generalization don’t predict that neural networks will perform well.</p>
<p><em>Many papers have tried to explain why over-parameterized models exceed expectations in practice, and I discuss some of those in my other series. But again, this paper does not go into this.</em></p>
</li>
</ol>
<h3 id="b-more-specific-context-on-approximation">B. More specific context on approximation</h3>
<p>As mentioned above, this paper (and hence this post) focuses on the first question of approximation. In particular, it discusses the representational power of a certain family of shallow neural networks. (Typically, “shallow” means depth-2—or one-hidden layer—and “deep” means any networks of depth 3 or more.)</p>
<p>There’s a well-known result about depth-2 networks that we build on: The <em>Universal Approximation Theorem</em>, which states that for any continuous function \(f\), there exists some depth-2 network \(g\) that closely approximates \(f\). (We’ll define “closely approximates” later on.)
Three variants of this result were proved in 1989 by <a href="https://www.sciencedirect.com/science/article/abs/pii/0893608089900038" target="_blank">three</a> <a href="https://www.semanticscholar.org/paper/Multilayer-feedforward-networks-are-universal-Hornik-Stinchcombe/f22f6972e66bdd2e769fa64b0df0a13063c0c101" target="_blank">different</a> <a href="https://link.springer.com/article/10.1007/BF02551274" target="_blank">papers</a>.
Here’s a <a href="http://neuralnetworksanddeeplearning.com/chap4.html" target="_blank">blog post</a> that gives a nice explanation of why these universal approximation results are true.</p>
<p>At first glance, it seems like this would close the question of approximation entirely; if a depth-2 neural network can express any kind of function, then there would be no need to question whether some networks have more approximation powers than others. However, the catch is that the Universal Approximation Theorem does not guarantee that \(g\) will be of a reasonable size; \(g\) could be an arbitrarily wide neural network, which obviously is a no-go in the real world where neural networks actually need to be computed and stored.</p>
<p>As a result, many follow-up papers have focused on the question about which kinds of functions can be <em>efficiently</em> approximated by certain neural networks and which ones cannot. By “efficient,” we mean that we want to show that a function can be approximated by a neural network with a size polynomial in the relevant parameters (the complexity of the function, the desired accuracy, the dimension of the inputs). We specifically <em>do not</em> want a function that requires size exponential in any of these quantities.</p>
<p><em>Depth-separation</em> is an area of study that has focused on studying the limitations of shallow networks compared to deep networks.</p>
<ul>
<li>A <a href="http://proceedings.mlr.press/v49/telgarsky16.html" target="_blank">2016 paper by Telgarsky</a> shows that there exist some very “bumpy” triangular functions that can be approximated by neural networks of depth \(O(k^3)\) with polynomial-wdith, but which require exponential width in order to be approximated by networks of depth \(\Omega(k)\).</li>
<li>Papers by <a href="http://proceedings.mlr.press/v49/eldan16.html" target="_blank">Eldan and Shamir (2016)</a>, <a href="http://proceedings.mlr.press/v70/safran17a.html" target="_blank">Safran and Shamir (2016)</a>, and <a href="http://proceedings.mlr.press/v65/daniely17a.html" target="_blank">Daniely (2017)</a> exhibit functions that separate depth-2 from depth-3. That is, the functions can be approximated by polynomial-size depth-3 networks, but they require exponential width in order to be approximated by depth-2 networks.</li>
</ul>
<p>One thing that these papers have in common is that they all require one of two things.
Either (1) the function is a very “bumpy” one that is highly oscillatory, or (2) the depth-2 networks can partially approximate the function, but cannot approximate it to an extremely high degree of accuracy. A <a href="https://arxiv.org/abs/1904.06984" target="_blank">2019 paper by Safran, Eldan, and Shamir</a> noticed this and asked whether there exist “smooth” functions that have separation between depth-2 and depth-3. This question was inspirational for our work, which poses questions about the limitations of certain kinds of 2-layer neural networks.</p>
<h3 id="c-random-bottom-layer-relu-networks">C. Random bottom-layer ReLU networks</h3>
<p>We actually consider a slightly more restrictive model than depth-2 neural networks. We focus on <em>two-layer random bottom-layer (RBL) ReLU neural networks</em>. Let’s break that down into pieces:</p>
<ul>
<li>
<p>“two layer” means that the neural network has a single hidden layer and can be represented by the following function, for parameters \(u \in \mathbb{R}^r, b \in \mathbb{R}^{r}, w \in \mathbb{R}^{r \times d}\):</p>
\[g(x) = \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>\(r\) is the width of the network and \(d\) is the input dimension.</p>
</li>
<li>“random bottom-layer” means that \(w\) and \(b\) are randomly chosen and then fixed. That means that when trying to approximate a function, we can only tune \(u\). This is also called the <em>random feature model</em> in other papers.</li>
<li>“ReLU” refers to the <em>restricted linear unit</em> activation function, \(\sigma(z) = \max(0, z)\). This is a popular activation function in deep learning.</li>
</ul>
<p>The following graphic visually summarizes the neural network:</p>
<p><img src="/assets/images/2021-08-15-hssv21/rbl.jpeg" alt="" /></p>
<p>Why do we focus on this family of neural networks?</p>
<ol>
<li>Any positive approximation results about this model also apply to arbitrary networks of depth 2. That is, if we want to show that a function can be efficiently approximated by a depth-2 ReLU network, it suffices to show that it can be efficiently approximated by a depth-2 <em>RBL</em> ReLU network. (This does not hold the other direction; there exist functions that can be efficiently approximated by depth-2 ReLU networks that <em>cannot</em> be approximated by depth-2 RBL ReLU nets.)</li>
<li>According to papers like <a href="https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html" target="_blank">Rahimi and Recht (2008)</a>, kernel functions can be approximated with random feature models. This means that our result can also be used to comment on the approximation powers of kernels, which Daniel discusses <a href="https://www.cs.columbia.edu/~djhsu/papers/dimension-argument.pdf" target="_blank">here</a>.</li>
<li>Recent research on the <em>neural tangent kernel (NTK)</em> studies the optimization and generalization powers of randomly-initialized neural networks that do not stray far from their initialization during training. The question of optimizing two-layer neural networks in this regime is then similar to the question of optimizing linear combinations of random features. Thus, the approximation properties proven here carry over to that kind of analysis. Check out papers by <a href="https://arxiv.org/abs/1806.07572" target="_blank">Jacot, Gabriel, and Hongler (2018)</a> and <a href="https://arxiv.org/abs/2002.04486" target="_blank">Chizat and Bach (2020)</a> to learn more about this model.</li>
</ol>
<p>Now, we jump into the specifics of our paper’s claims. Later, we’ll give an overview of how those claims are proven and discuss some broader implications of these results.</p>
<h2 id="ii-what-are-the-specific-claims">II. What are the specific claims?</h2>
<p>The key results in our paper are corresponding upper and lower bounds:</p>
<ul>
<li>If the function \(f: \mathbb{R}^d \to \mathbb{R}\) is either “smooth” or low-dimensional, then it’s “easy” to approximate \(f\) with some RBL ReLU network \(g\). (The upper bound.)</li>
<li>If \(f\) is both “bumpy” and high-dimensional, then it’s “hard” to approximate \(f\) with some RBL ReLU net \(g\). (The lower bound.)</li>
</ul>
<p>All of this is formalized in the next few paragraphs.</p>
<h3 id="a-notation">A. Notation</h3>
<p><strong>What do we mean by a “smooth” or “bumpy” function?</strong> As discussed earlier, works on depth separation frequently exhibit functions that require exponential width to be approximated by depth-2 neural networks. However, these functions are highly oscillatory and hence very steep. We quantify this smoothness by using the Lipschitz constant of a function \(f\). \(f\) has Lipschitz constant \(L\) if for all \(x, y \in \mathbb{R}^d\), we have \(\lvert f(x) - f(y)\rvert \leq L \|x - y\|_2\). This bounds the slope of the function and prevents \(f\) from rapidly changing value. Therefore, a function can only be high-frequency (and bounce back and forth rapidly between large and small values) if it has a small Lipschitz constant.</p>
<p>We also quantify smoothness using the Sobolev class of a function in the appendix of our paper. We provide very similar bounds for this case, but we don’t focus on them in this post.</p>
<p><strong>What does it mean to be easy to approximate?</strong> We consider an \(L_2\) notion of approximation over the solid cube \([-1, 1]^d\). That is, we say that \(g\) <em>\(\epsilon\)-approximates</em> \(f\) if</p>
\[\|g - f\|_2 = \sqrt{\mathbb{E}_{x \sim \text{Unif}([-1, 1]^d)}[(g(x) - f(x))^2]} \leq \epsilon.\]
<p>Notably, this is a <em>weaker</em> notion of approximation than the \(L_\infty\) bounds that are used in other papers. If \(f\) can be \(L_\infty\)-approximated, then it can also be \(L_2\)-approximated.</p>
<p><strong>What does it mean to be easy to approximate <em>with an RBL ReLU function</em>?</strong>
Since we let \(g\) be an RBL ReLU network that has random weights, we need to incorporate that randomness into our definition of approximation. To do so, we say that we can approximate \(f\) with an RBL network of width \(r\) if with probability \(0.5\), there exists some \(u \in \mathbb{R}^r\) such that the RBL neural network \(g\) with parameters \(w, b, u\) can \(\epsilon\)-approximate \(f\).
The probability is over random parameters \(w\) and \(b\) drawn from some distribution \(\mathcal{D}\)
We let the <em>minimum width</em> needed to approximate \(f\) with respect to \(\epsilon\) and \(\mathcal{D}\) denote the smallest such \(r\).</p>
<p>(The paper also includes \(\delta\), which corresponds to the probability of success. For simplicity, we leave it out and take \(\delta = 0.5\).)</p>
<p>We’re now ready to give our two main theorems.</p>
<h3 id="b-the-theorems">B. The theorems</h3>
<p><em><strong>Theorem 1 [Upper Bound]:</strong> For any \(L\), \(d\), \(\epsilon\), there exists a symmetric parameter distribution \(\mathcal{D}\) such that the minimum width of any \(L\)-Lipschitz function \(f: \mathbb{R}^d \to \mathbb{R}\) is at most</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{O(1)}.\]
<p>The term in this bound can also be written as</p>
\[\exp\left(O\left(\min\left(d \log\left(\frac{L^2}{\epsilon^2 d}+ 2\right), \frac{L^2}{\epsilon^2} \log\left(\frac{d\epsilon^2}{L^2} + 2\right)\right)\right)\right).\]
<p><em><strong>Theorem 2 [Lower Bound]:</strong> For any \(L\), \(d\), \(\epsilon\) and any symmetric parameter distribution \(\mathcal{D}\), there exists an \(L\)-Lipschitz function \(f\) whose minimum width is at least</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{\Omega(1)}.\]
<p>Thus, the key take-away is that our upper and lower bounds are matching up to a polynomial factor:</p>
<ul>
<li>When the dimension \(d\) is constant, than both terms are polynomial in \(\frac{L}{\epsilon}\), which means that \(L\)-Lipschitz \(f\) can be efficiently \(\epsilon\)-approximated.</li>
<li>When the smoothness-to-accuracy ratio \(\frac{L}{\epsilon}\) is constant, then the terms are polynomial in \(d\), which is also efficiently approximable.</li>
<li>When \(d = \Theta(L / \epsilon)\), then both terms are exponential in \(d\), which makes it impossible to efficiently approximate.</li>
</ul>
<p>These back up our high-level claim from before: efficient approximation of \(f\) with RBL ReLU networks is possible if and only if \(f\) is either smooth or low-dimensional.</p>
<p>Before explaining the proofs, we’ll give an overview about why these results are significant compared to previous works.</p>
<h3 id="c-comparison-to-previous-results">C. Comparison to previous results</h3>
<p>The approximation powers of shallow neural networks has been widely studied in terms of \(d\), \(\epsilon\), and smoothness measures (including Lipschitzness).
Our results are novel because they’re the first (as far as we know) to look closely at the interplay between these values and obtain nearly tight upper and lower bounds.</p>
<p>Papers that prove upper bounds tend to focus on either the low-dimensional case or the smooth case.</p>
<ul>
<li><a href="http://proceedings.mlr.press/v32/andoni14.html" target="_blank">Andoni, Panigrahy, Valiant, and Zhang (2014)</a> show that degree-\(k\) polynomials can be approximated with RBL networks of width \(d^{O(k)}\). Because \(L\)-Lipschitz functions can be approximated by polynomials of degree \(O(L^2 / \epsilon^2)\), one can equivalently say that networks of width \(d^{O(L^2 / \epsilon^2)}\) are sufficient. This works great when \(L /\epsilon\) is constant, but the bounds are bad in the “bumpy” case where the ratio is large.</li>
<li>On the other hand, <a href="https://jmlr.org/papers/v18/14-546.html" target="_blank">Bach (2017)</a> shows \((L / \epsilon)^{O(d)}\)-width approximability results for \(L_\infty\). This is fantastic when \(d\) is small, but not in the high-dimensional case. (This \(L_\infty\) part is more impressive than our \(L_2\) bounds, which means that we don’t strictly improve upon this result in our domain.)</li>
</ul>
<p>Our results are the best of both worlds, since they trade off \(d\) versus \(L /\epsilon\). They also cannot be substantially improved upon because they are nearly tight with our lower bounds.</p>
<p>Our lower bounds are novel because they handle a broad range of choices for \(L/ \epsilon\) and \(d\).</p>
<ul>
<li>The limitations of 2-layer neural networks were studied in the 1990s by <a href="https://www.sciencedirect.com/science/article/pii/S0021904598933044" target="_blank">Maiorov (1999)</a>, and he proves bounds that looks more impressive than ours at first glance, since he argues that width \(\exp(\Omega(d))\) width is necessary for smooth functions. (He actually looks at Sobolev smooth functions, but the analysis could also be done for Lipschitz functions.) However, these bounds don’t necessarily hold for all choices of \(\epsilon\). Therefore, they don’t say anything about the regime where \(\frac{L}{\epsilon}\) is constant, where it’s impossible to prove a lower bound that’s exponential in \(d\).</li>
<li><a href="https://arxiv.org/abs/1904.00687" target="_blank">Yehudai and Shamir (2019)</a> show that \(\exp(d)\) width is necessary to approximate simple ReLU functions with RBL neural networks. However, their results require that the ReLU be a very steep one, with Lipschitz constant scaling polynomially with \(d\). Hence, this result also only covers the regime where \(\frac{L}{\epsilon}\) is large. Our bounds say something about functions of all levels of smoothness.</li>
</ul>
<p>Now, we’ll break down our argument on a high level, with the help of some pretty pictures.</p>
<h2 id="iii-why-are-they-true">III. Why are they true?</h2>
<p>Before giving the proofs, I’m going to restate the theorems in terms of a combinatorial quantity, \(Q_{k,d}\), which corresponds to the number of \(d\)-dimensional integer lattice points with \(L_2\) norm at most \(k\). That is,</p>
\[Q_{k,d} = \lvert\{K \in \mathbb{Z}^d: \|K\|_2 \leq k \} \rvert.\]
<p>As an example, \(Q_{4,2}\) can be visualized as the number of purple points in the below image:</p>
<p><img src="/assets/images/2021-08-15-hssv21/qkd.jpeg" alt="" width="50%" /></p>
<p>We can equivalently write the upper and lower bounds on the minimum width as \(Q_{2L/\epsilon, d}^{O(1)}\) and \(\Omega(Q_{L/18\epsilon, d})\) respectively. This combinatorial quantity turns out to be important for the proofs of both bounds.</p>
<p>A key building block for both proofs is an orthonormal basis. I define orthonormal bases in <a href="/2021/07/16/orthogonality.html" target="_blank">a different blog post</a> and explain why they’re useful there. If you aren’t familiar, check that one out. We use the following family of sinusoidal functions as a basis for the \(L_2\) Hilbert space on \([-1, 1]^d\) throughout:</p>
\[\mathcal{T} \approx \{T_K: x \mapsto \sqrt{2}\cos(\pi\langle K, x\rangle): K \in \mathbb{Z}^d\}.\]
<p><em>Note: This is an over-simplification of the family of functions to be easier to write down. Actually, half of the functions need to be sines instead of cosines. However, it’s a bit of a pain to formalize and you can see how it’s written up in the paper. I’m using the \(\approx\) symbol above because this is “morally” the same as the true family of functions, but a lot easier to write down.</em></p>
<p>This family of functions has several properties that are very useful for us:</p>
<ul>
<li>
<p>The functions are orthonormal with respect to the Hilbert space for the \(L_2\) space over the uniform distribution on \([-1, 1]^d\). That is, for all \(K. K' \in \mathcal{T}\),</p>
\[\langle T_K, T_{K'}\rangle = \mathbb{E}_{x}[T_K(x)T_{K'}(x)] = \begin{cases}1 & K = K' \\ 0 & \text{otherwise.} \\ \end{cases}\]
</li>
<li>The functions span the Hilbert space \(L_2([-1,1]^d)\). Put together with the orthonormality, \(\mathcal{T}\) is an orthonormal basis for \(L_2([-1,1]^d)\).</li>
<li>The Lipschitz constant of each of these functions is bounded. Specifically, the Lipschitz constant of \(T_K\) is at most \(\sqrt{2} \pi \|K\|_2\).</li>
<li>The derivative of each function in \(\mathcal{T}\) is also a function that’s contained in \(\mathcal{T}\) (if you include the sines too).</li>
<li>All elements of \(\mathcal{T}\) are ridge functions. That is, they can each be written as \(T_K(x) = \phi(\langle v, x \rangle)\) for some \(\phi:\mathbb{R}\to \mathbb{R}\).The function depends only on one direction in \(\mathbb{R}^d\) and is intrinsically one-dimension. This will be important for the upper bound proof.</li>
<li>If we let \(\mathcal{T}_k = \{T_K \in \mathcal{T}: \|K\|_2 \leq k\}\), then \(\lvert\mathcal{T}_k\rvert = Q_{k,d}\).</li>
</ul>
<p>Now, we’ll use this basis to discuss our proof of the upper bound.</p>
<h3 id="a-upper-bound-argument">A. Upper bound argument</h3>
<p>The proof of the upper bound boils down to two steps. First, we show that the function \(f\) can be \(\frac{\epsilon}{2}\)-approximated by a low-frequency trigonometric polynomial (that is, a linear combination of sines and cosines in \(\mathcal{T}_k\) for some \(k = O(L^2 / \epsilon^2)\)). Then, we show that this trigonometric polynomial can be \(\frac{\epsilon}{2}\)-approximated in turn by an RBL ReLU network.</p>
<p>For the first step—which corresponds to Lemma 7 of the paper—we apply the fact that \(f\) can be written as a linear combination of sinusoidal basis elements. That is,</p>
\[f(x) = \sum_{K \in \mathbb{Z}^d} \alpha_K T_K(x),\]
<p>where \(\alpha_K = \langle f, T_K\rangle\).
This means that \(f\) is a combination of sinusoidal functions pointing in various directions of various frequencies.
We show that for some \(k = O(L / \epsilon)\),</p>
\[P(x) := \sum_{K \in \mathbb{Z}^d, \|K\|_2 \leq k} \alpha_K T_K(x)\]
<p>satisfies \(\|P - f\|_2 \leq \frac{\epsilon}{2}\).
To do so, we show that all \(\alpha_K\) terms for \(\|K\|_2 > k\) are very close to zero in the proof of Lemma 8.
The argument centers on the idea that if \(\alpha_K\) is large for large \(\|K\|_2\), then \(f\) is heavily influenced by a high-frequncy sinusoidal function, which means that \(\|\nabla f(x)\|\) must be large at some \(x\).
However, \(\|\nabla f(x)\| \leq L\) by our smoothness assumption on \(f\), so too large values of \(\alpha_K\) contradict this.</p>
<p>For the second part, we show that \(P\) can be approximated by a linear combination of random ReLUs.
To do so, we express \(P\) as a <em>superposition</em> of or expectation over random ReLUs.
We show that there exists some parameter distribution \(\mathcal{D}\) (which depends on \(d, L, \epsilon\), but not on \(f\)) and some bounded function \(h(b, w)\) (which <em>can</em> depend on \(f\)) such that</p>
\[P(x) = \mathbb{E}_{(b, w) \sim \mathcal{D}}[h(b, w)\sigma(\langle w, x\rangle + b)].\]
<p>However, it’s not immediately clear how one could find \(h\) and why one would know that \(h\) is bounded.
To find \(h\), we take advantage of the fact that \(P\) is a linear combination of trigonometric sinusoidal ridge functions by showing that every \(T_K\) can be expressed as a superposition of ReLUs and combining those to get \(h\).
The “ridge” part is key here; because each \(T_K\) is effectively one-dimensional, it’s possible to think of it being approximated by ReLUs, as visualized below:</p>
<p><img src="/assets/images/2021-08-15-hssv21/cos.jpeg" alt="" /></p>
<p>Each function \(T_K\) can be closely approximated by a piecewise-linear ridge function, since it has bounded gradients and because it only depends on \(x\) through \(\langle K, x\rangle\).
Therefore, \(T_K\) can also be closely approximated by a linear combination of ReLUs, because those can easily approximate piecewise linear ridge functions.
This makes it possible to represent each \(T_K\) as a superposition of ReLUs, and hence \(P\) as well.</p>
<p>Now, \(f\) is closely approximated by \(P\), and \(P\) can be written as a bounded superpositition of ReLUs.
We want to show that \(P\) can be approximated by a linear combination of a <em>finite and bounded</em> number of random ReLUs, not an infinite superposition of them.
This last step requires sampling \(r\) sets of parameters \((b^{(i)}, w^{(i)}) \sim \mathcal{D}\) for \(i \in \{1, \dots, r\}\) and letting</p>
\[g(x) := \frac{1}{r} \sum_{i=1}^r h(b^{(i)}, w^{(i)}) \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>When \(r\) is large enough, \(g\) is a 2-layer RBL ReLU network that becomes a very close approximation to \(P\), which means it’s also a great approximation to \(f\).
Such a sufficiently large \(r\) can be quantified with the help of standard concentration bounds for Hilbert spaces.
This wraps up the upper bound.</p>
<h3 id="b-lower-bound-argument">B. Lower bound argument</h3>
<p>For the lower bounds, we want to show that for any bottom-layer parameters \((b^{(i)}, w^{(i)})\) for \(1 \leq i \leq r\), there exists some \(L\)-Lipschitz function \(f\) such that for any choice of top-layer \(u^{(1)}, \dots, u^{(r)}\):</p>
\[\sqrt{\mathbb{E}_x\left[\left(f(x) - \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)})\right)^2\right]} \geq \epsilon.\]
<p>This resembles a simpler linear algebra problem:
Fix any vectors \(v_1, \dots, v_r \in \mathbb{R}^N\).
\(\mathbb{R}^N\) has a standard orthonormal basis \(e_1, \dots, e_N\).
Under which circumstances is there some \(e_j\) that cannot be closely approximated by any linear combination of \(v_1, \dots, v_r\)?</p>
<p>It turns out that when \(N \gg r\) there can be no such approximation.
This follows by a simple dimensionality argument.
The span of \(v_1, \dots, v_r\) is a subspace of dimension at most \(r\).
Since \(r \ll N\), it makes sense that an \(r\)-dimensional subspace cannot be close to every \(N\) orthonormal vector, since they lie in a much higher dimensional object and each is perpendicular to every other.</p>
<p><img src="/assets/images/2021-08-15-hssv21/span.jpeg" alt="" /></p>
<p>For instance, the above image illustrates the claim for \(N = 3\) and \(r = 2\). While the span of \(v_1\) and \(v_2\) is close to \(e_1\) and \(e_2\), the vector \(e_3\) is far from that plane, and hence is inapproximable by linear combinations of the two.</p>
<p>In our setting, we replace \(\mathbb{R}^N\) with the \(L_2\) Hilbert space over functions on \([-1, 1]^d\); \(v_1, \dots, v_r\) with \(x \mapsto \sigma(\langle w^{(1)}, x\rangle + b^{(1)}), \dots, x \mapsto \sigma(\langle w^{(r)}, x\rangle + b^{(r)})\); and \(\{e_1, \dots, e_N\}\) with \(\mathcal{T}_k\) for \(k = \Omega(L)\).
As long as \(Q_{k,d} \gg r\), then there is some \(O(\|K \|_2)\)-Lipschitz function \(T_K\) that can’t be approximated by linear combinations of ReLU features.
By the assumption on \(k\), \(T_K\) must be \(L\)-Lipschitz as well.</p>
<p>The dependence on \(\epsilon\) can be introduced by scaling \(T_K\) appropriately.</p>
<h2 id="parting-thoughts">Parting thoughts</h2>
<p>To reiterate, our results show the capabilities and limitations of 2-layer random bottom-layer ReLU networks.
We show a careful interplay between the Lipschitzness of the function to approximate \(L\), the dimension \(d\), and the accuracy parameter \(\epsilon\).
Our bounds rely heavily on orthonormal functions.</p>
<p>Our results have some key limitations.</p>
<ul>
<li>Our upper bounds would be more impressive if they used the \(L_\infty\) notion of approximation, rather than \(L_2\). (Conversely, our lower bounds would be <em>less</em> impressive if they used \(L_\infty\) instead.)</li>
<li>The distribution over training parameters \(\mathcal{D}\) that we end up using for the upper bounds is contrived and depends on \(L, \epsilon, d\) (even if not on \(f\)).</li>
<li>Our bounds only apply when samples are drawn uniformly from \([-1, 1]^d\). (We believe our general approach will also work for the Gaussian probability measure, which we discuss at a high level in the appendix of our paper.)</li>
</ul>
<p>We hope that these limitations are addressed by future work.</p>
<p>Broadly, we think our paper fits into the literature on neural network approximation because it shows that the smoothness of a function is very relevant to its ability to be approximated by shallow neural networks.</p>
<ul>
<li>Our paper contributes to the question posed by <a href="https://arxiv.org/abs/1904.06984" target="_blank">SES19</a> (Are there any 1-Lipschitz functions that cannot be approximated efficiently by depth-2 but can by depth-3?) by showing that <em>all</em> 1-Lipschitz functions are approximable with respect to the \(L_2\) measure.</li>
<li>In addition, our results build on those of a recent paper by <a href="https://arxiv.org/abs/2102.00434" target="_blank">Malach, Yehudai, Shalev-Shwartz, and Shamir (2021)</a>, that suggests that the only functions that can be efficiently <em>learned</em> via gradient descent by deep networks are those that can be efficiently <em>approximated</em> by a shallow network. They show that the inefficient approximation of a function by depth-3 neural networks implies inefficient learning by neural networks of any depth; our results strengthens this to “inefficient approximation of a function by depth-<strong>2</strong> neural networks.”</li>
</ul>
<p>Thank you so much for reading this blog post! I’d love to hear about any thoughts or questions you may have. And if you’d like to learn more, check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> or <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">the talks</a>!</p>Clayton SanfordIn the past few weeks, I’ve written several summaries of others’ work on machine learning theory. For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, Rocco Servedio and Daniel Hsu, and another Columbia PhD student, Manolis Vlatakis-Gkaragkounis. It will be presented this week at COLT (Conference on Learning Theory) 2021, which is happening in-person in Boulder, Colorado. I’ll be there to discuss the paper and learn more about other work in ML theory. (Hopefully, I’ll put up another blog post after about what I learned from my first conference.)[OPML#5] BL20: Failures of model-dependent generalization bounds for least-norm interpolation2021-07-30T00:00:00+00:002021-07-30T00:00:00+00:00http://blog.claytonsanford.com/2021/07/30/bl20<p><em>This is the fifth of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<!-- [BL20](https://arxiv.org/abs/2010.08479){:target="_blank"} [[OPML#5]](/2021/07/30/bl20.html){:target="_blank"} -->
<p>I really enjoyed reading this paper, <a href="https://arxiv.org/abs/2010.08479" target="_blank">“Failures of model-dependent generalization bounds for least-norm interpolation,”</a> by Bartlett and Long. (The names are familiar from <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>.)
It follows in the vein of papers like <a href="https://arxiv.org/abs/1611.03530" target="_blank">ZBHRV17</a> and <a href="https://arxiv.org/abs/1902.04742" target="_blank">NK19</a>, which demonstrate the limitations of classical generalization bounds.</p>
<p>This work differs from the double-descent papers that have been previously reviewed on this blog, like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a> <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>.
These papers argue that there exist better bounds on generalization error for over-parameterized linear regression than the ones typically suggested by classical approaches like VC-dimension and Rademacher complexity.
However, they dont <em>prove</em> that there cannot be better “classical” generalization bounds; they just show that the well-known bounds are inferior to their proposed bounds.
On the other hand, this paper proves that a broad family of traditional generalization bounds are unable to explain the phenomenon of the success of interpolating methods.</p>
<p>The gist of the argument is that it’s not sufficient to look at the number of samples and the complexity of the hypothesis to explain the success of interpolating models.
Successful bounds must take into account more information about the data distribution.
Notably, the bounds in BHX19, BLLT19, MVSS19, and HMRT19 all rely on properties of the data distribution, like the eigenvalues of the covariance matrix and the amount of additive noise in each label.
The current paper (BL20) posits that such tight bounds are impossible without access to this kind of information.</p>
<p>In this post, I present the main theorem and give a very hazy idea about why it works.
Let’s first make the learning problem precise.</p>
<h2 id="learning-problem">Learning problem</h2>
<ul>
<li>We have labeled data \((x, y) \in \mathbb{R}^d \times \mathbb{R}\) drawn from some distribution \(P\).
<ul>
<li>They restrict \(P\) to give it nice mathematical properties. Specifically, the inputs \(x \in \mathbb{R}^d\) must be drawn from a Gaussian distribution and \((x, y)\) must have subgaussian tails. We’ll call these “nice” distributions.</li>
</ul>
</li>
<li>Let the <em>risk</em> of some prediction rule \(h: \mathbb{R}^d \to \mathbb{R}\) be \(R_P(h) = \mathbb{E}_{x, y}[(y - h(x))^2]\).</li>
<li>Let \(R_P^*\) be the best risk over all \(h\).</li>
<li>The goal is to consider bounds on \(R_P(h) - R_P^*\), where \(h\) is an <em>least-norm interpolating</em> learning rule on \(n\) training samples.
<ul>
<li>i.e. \(h(x) = \langle x, \theta\rangle\) where \(\theta \in \mathbb{R}^d\) minimizes the least-squares error: \(\sum_{i=1}^n(\langle x_i, \theta\rangle - y_i)^2\). Ties are broken by choosing the \(\theta\) that minimizes \(\|\theta\|_2\). The interpolation regime occurs when the least-squares error is zero.</li>
</ul>
</li>
<li>We consider bounds \(\epsilon(h, n, \delta)\), such that \(R_p(h)- R_P^{*} \leq \epsilon(h, n, \delta)\) with probability \(1 - \delta\) over the \(n\) training samples from \(P\), for which \(h\) is least-norm interpolating.
<ul>
<li>Notably, these bounds cannot include any more information about the learning problem; these must hold for any distribution \(P\).</li>
<li>For the theorem to work, they restrict themselves to bounds that are <em>bounded antimonotonic</em>, which means that they cannot suddenly become much worse as the number of samples increases. (e.g. \(\epsilon(h, 2n, \delta)\) cannot be much larger than \(\epsilon(h, n, \delta)\).)</li>
</ul>
</li>
</ul>
<h2 id="the-result">The result</h2>
<p>Now, I give a rather hand-wavy paraphrase of the theorem:</p>
<p><em><strong>Theorem 1:</strong> Suppose \(\epsilon\) is a bound that depends on \(h\), \(n\), and \(\delta\) that applies to all nice distributions \(P\).
Then, for a “very large fraction” of values of \(n\) as \(n\) grows, there exists a distribution \(P_n\) such that</em></p>
\[\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \leq O(1 / \sqrt{n})] \geq 1 - \delta\]
<p><em>but</em></p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq \Omega(1)] \geq \frac{1}{2},\]
<p><em>where \(h\) is the least-norm interpolant of a set of \(n\) points drawn from \(P_n\). The probabilities above refer to randomness from the training sample drawn from \(P_n\).</em></p>
<p>Let’s break this down and talk about what it means.</p>
<p>The generalization bound \(\epsilon\) can depend on the minimum-norm interpolating prediction rule \(h\), the number of samples \(n\), and the confidence parameter \(\delta\).
It <em>cannot</em> depend on the distribution over samples \(P\), and it must apply to all such “nice” distributions.
This opens up the possibility that a satisfactory bound \(\epsilon\) could perform much better on some distributions than others.</p>
<ul>
<li>
<p>This result particularly applies to generalization bounds that make use of some property of the prediction rule \(h\). For instance, it demonstrates the limitations of <a href="https://ieeexplore.ieee.org/document/661502" target="_blank">this 1998 Bartlett paper</a>, which gives generalization bounds that are small when the parameters of \(h\) have small norms.</p>
</li>
<li>
<p>Note that this isn’t really talking about “traditional” capacity-based generalization bounds, like those that rely on <a href="https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension" target="_blank">VC-dimension</a>. These capacity-based bounds are applied to the <em>hypothesis class</em> \(\mathcal{H}\) that contains \(h\), rather than the prediction rule \(h\) itself.</p>
<p>These kinds of bounds are already overly pessimistic in the over-parameterized regime, however. Measurements of the capacity of \(\mathcal{H}\)—like the VC-dimension and the <a href="https://en.wikipedia.org/wiki/Rademacher_complexity" target="_blank">Rademacher complexity</a>—will always lead to vacuous generalization bounds for interpolating classifiers because those bounds rely on limiting the expressive power of hypotheses in \(\mathcal{H}\). From the lens of capacity-based generalization approaches, overfitting is <em>always</em> bad, which makes a nontrivial analysis of interpolation methods impossible with these tools.</p>
</li>
</ul>
<p>\(\epsilon\) does indeed perform much better on some distributions than others. The meat and potatoes of the proof shows the existence of some nice distribution where a bound \(\epsilon\) necessarily underperforms, even though the minimum-norm interpolating solution actually has a small generalization error.</p>
<p><img src="/assets/images/2021-07-30-bl20/bound.jpeg" alt="" /></p>
<ul>
<li>
<p>The first inequality in the theorem demonstrates how the minimum-norm interpolating classifier does well.
This is represnted by the the true generalization errors lying below the green dashed line, which corresponds to the bound in the first inequality.
As \(n\) grows, the true generalization error approaches zero with high probability.</p>
</li>
<li>
<p>On the other hand, the underperformance is illustrated by the second inequality, which shows that the bound \(\epsilon\) often cannot guarrantee that the generalization error is smaller than some constant as \(n\) becomes large.
As visualized above, the bound \(\epsilon\) (represented by red dots with a red line corresponding to the expected value of \(\epsilon\)) will most of the time (but not always) lie above the constant curve denoted by the dashed red line.
This isn’t great, because we should expect an abundance of training samples \(n\) to translate to an error bound that approaches zero as \(n\) approaches infinity.</p>
</li>
</ul>
<p>So far, nothing has been said about the dimension of the inputs, \(d\).
The authors define \(d\) within the context of the distributions \(P_n\) as roughly \(n^2\). Thus, \(d \gg n\) and this problem deals squarely with the over-parameterized regime.</p>
<p>To reiterate, the key takeaway here is that the data distribution is very important for evaluating whether successful generalization occurs.
Without knowledge of the data distribution, it’s impossible to give accurate generalization bounds for the over-parameterized case (\(d \gg n\)).</p>
<h2 id="proof-ideas">Proof ideas</h2>
<p>The main strategy in this proof is to show the existence of a “good distribution” \(P_n\) and a “bad distribution” \(Q_n\) that are very similar, but where minimum-norm interpolation yields a much smaller generalzation error on \(P_n\) than \(Q_n\).
This gap forces any valid generalization error bound \(\epsilon\) to be large, despite the fact that the the minimum-norm interpolator has small generalzation error for \(P_n\).</p>
<p>To satisfy the similarity requirement, \(P_n\) and \(Q_n\) must be indistinguishable with respect to \(h\).
Consider full training samples of \(n\) \(d\)-dimensional inputs and labels \((X_P, Y_P), (X_Q, Y_Q) \in \mathbb{R}^{n \times d} \times \mathbb{R}^n\) drawn from the two respective distributions.
Then, the probability that \(h\) is the minimum-norm interpolator of \((X_P, Y_P)\) must be identical to the probability that it is the minimum-norm interpolator of \((X_Q, Y_Q)\).
If this is the case, then \(\epsilon\) must be defined to ensure that each of</p>
\[\epsilon(h, n, \delta) \geq R_{P_n}(h) - R_{P_n}^* \quad \text{and} \quad \epsilon(h, n, \delta)\geq R_{Q_n}(h) - R_{Q_n}^*\]
<p>hold with probability \(1 - \delta\).
This then means that it must be the case that for any \(t \in \mathbb{R}\):</p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq t] \geq \max(\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \geq t], \mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq t]).\]
<p>To prove the theorem, it suffices to show \(R_{P_n}(h) - R_{P_n}^*\) is very small and \(R_{Q_n}(h) - R_{Q_n}^*\) is large with high probability.
This forces \(\epsilon(h, n, \delta)\) to be large and \(R_{P_n}(h) - R_{P_n}^*\) to be small with high probability, which concludes the proof.</p>
<p>A key idea towards showing this gap between the generalization of \(P_n\) and \(Q_n\) is to define distributions that behave very differently in testing, despite being indistinguishable from the standpoint of training.
To implement this idea, \(Q_n\) will reuse samples in testing phase, while \(P_n\) will not.</p>
<p>Now, we define the two distributions, with the help of a third “helper” distribution \(D_n\).</p>
<h3 id="d_n-the-skewed-gaussian-distribution">\(D_n\): The skewed Gaussian distribution</h3>
<p>We draw an input \(x_i\) from the \(d\)-dimensional Gaussian distribution \(\mathcal{N}(0, \Sigma)\) with mean zero and diagonal covariance matrix \(\Sigma\) with</p>
\[\Sigma_{j,j} = \lambda_j = \begin{cases}
\frac{1}{81} & j = 1 \\
\frac{1}{d^2} & j > 1.
\end{cases}\]
<p>When \(d\) is large, this corresponds to a distribution where \(x_1\) will be very large relative to \(x_2, \dots, x_d\), which trend towards zero.
The label \(y_i\) is drawn by taking \(y_i = \langle x_i, \theta\rangle + \epsilon_i\), where \(\epsilon_i \sim \mathcal{N}(0, \frac{1}{81})\).
Thus, the noise is drawn at the scale of the dominant first coordinate.</p>
<p><img src="/assets/images/2021-07-30-bl20/Dn.jpeg" alt="" /></p>
<p>We use this skewed distribution because it works beautifully with the bounds in the minimum-norm interpolant that are laid out in BLLT19.
Using notation from <a href="/2021/07/11/bllt19.html" target="_blank">my blog post on BLLT19</a>, we can characterize the effective dimensions \(r_k(\Sigma)\) and \(R_k(\Sigma)\), which yield clean risk bounds.</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} =
\begin{cases}
\frac{\frac{1}{81} + \frac{d-1}{d^2}}{\frac{1}{81}} = \Theta(1) & k = 0 \\
\frac{\frac{d-k}{d^2}}{\frac{1}{d^2}} = d-k & k > 0.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2} =
\begin{cases}
\frac{\left(\frac{1}{81} + \frac{d-1}{d^2}\right)^2}{\frac{1}{81^2} + \frac{d-1}{d^4}} = \Theta(1) & k = 0 \\
\frac{\left(\frac{d-k}{d^2}\right)^2}{\frac{d-k}{d^4}} = d-k & k > 0.
\end{cases}\]
<p>By taking \(k^* = 1\) and applying the bound, then with high probability:</p>
\[R(\hat{\theta}) = O\left(\|\theta^*\|^2 \lambda_1\left( \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{r_0(\Sigma)}{n}\right) + \sigma^2\left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right) \right)\]
\[= O\left(\|\theta^*\|^2\left(\frac{1}{\sqrt{n}} + \frac{1}{n}\right) + \frac{1}{81}\left(\frac{1}{n} + \frac{n}{d-1}\right)\right).\]
<p>If we take \(d = n^2\), then this term trends towards zero at a rate of \(\frac{1}{\sqrt{n}}\) as \(n\) approaches infinity, which validates the kind of bound w’ere looking at for \(P_n\).
(Note: \(d\) does not exactly equal \(n^2\) in the paper; there are a few more technicalities here that we’re glossing over.)</p>
<p>This gives us an example where minimum-norm interpolation does fantastically. However, it does not show why the generalization bound \(\epsilon(h, n, \delta)\) cannot be tight.
To do so, we define the actual two distributions we care about—\(Q_n\) and \(P_n\)—in terms of \(D_n\).</p>
<h3 id="q_n-poor-interpolation-from-sample-reuse">\(Q_n\): Poor interpolation from sample reuse</h3>
<p>The first confusing thing about \(Q_n\) is that it’s a random distribution.
That is, we can think of \(Q_n\) being drawn from a distribution over distributions \(\mathcal{Q}_n\), since it depends on a random sample from \(D_n\).</p>
<p>To define \(Q_n\), draw \(m = \Theta(n)\) independent samples \((x_i, y_i)_{i \in [m]}\) from \(D_n\).
\(Q_n\) will be supported on these \(m\) samples.</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn1.jpeg" alt="" /></p>
<p>After fixing these samples, we can draw \((x, y)\) from \(Q_n\) by first uniformly selecting \(x\) from \(\{x_1, \dots, x_m\}\), the set of pre-selected points.
Then, we choose \(y\) using the same approach that we did for \(D_n\): \(y = \langle x, \theta\rangle + \epsilon\) for \(\epsilon \sim \mathcal{N}(0, \frac{1}{81})\).</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn2.jpeg" alt="" /></p>
<p>What this means is that the training inputs \(x_i\) for \(i \in [n]\) will exactly reoccur in the expected risk, albeit with different labels \(y_i\).
This differs greatly from \(D_n\), where the continuity of the distribution over \(x_i\)’s ensures that the same exact sample would never realistically be chosen in “testing.”</p>
<p>The crux of the argument that \(Q_n\) is “bad” comes from Lemma 5, which suggests that least-norm interpolation will perform poorly on inputs \(x_i\) that show up exactly once in the training set.
When these are drawn again when computing the expected risk (with new labels), they’ll have substantially higher error than would a random input from \(D_n\).
This allows the authors to show that—for a proper choice of \(m\)—</p>
\[\mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq \Omega(1)] \geq \frac{1}{2}.\]
<p>Now, it only remains to show that \(Q_n\) is indistinguishable in the training phase from a “good” distribution that has low risk for least-norm interpolation.</p>
<p>\(D_n\) is good, but \(Q_n\) unfortunately cannot be contrasted to \(D_n\) in this manner.
Because \(D_n\) never repeats training samples, the two have somewhat different distributions over interpolators \(h\).
Instead, we define \(P_n\) in a slightly different way to have the nice interpolation properties of \(D_n\), while being identical to \(Q_n\) in the training phase.</p>
<h3 id="p_n-d_n-but-with-extra-samples">\(P_n\): \(D_n\) but with extra samples</h3>
<p>The idea with \(D_n\) is that it draws inputs \(x_i\) from \(P_n\), but that it will occasionally draw more than one and average their labels \(y_i\) together to produce a new label.</p>
<p>This provides indistinguishability from \(Q_n\) in the training phase.
Both draw a collection of samples—with some of them appearing multiple times in the training set—and both minimum-norm interpolators will take these properties into account.
This indistinguishability is proved in Lemma 7 and relies on careful choices of the number of original samples \(m\) for \(Q_n\) and the amount of repeated samples in \(P_n\). This idea is put together with Lemma 5 (which shows that \(Q_n\) has poor minimum-norm interpolation behavior) to show that \(\epsilon(h, n, \delta)\) cannot be small.</p>
<p>However, \(P_n\) is <em>not</em> a random distribution and it will <em>not</em> carry that repetition over to the “evaluation phase.”
The distribution used to evaluate risk—like \(D_n\) and unlike \(Q_n\)—will not contain any of the same \(x_i\)’s that were used in the training phase.
This causes the interpolation guarantees to be roughly the same as \(D_n\).
This gives the gap we’re looking for, which is formalized in Lemma 10.</p>
<p>Put together with Lemma 5, this gives the bound we’re looking for and concludes the story that the success (or lack thereof) of minimum-norm interpolation can only be understood by considering the data distribution, and <em>not</em> just the number of samples \(n\) and properties of the interpolants \(h\).</p>
<p><em>Thanks for reading the post! As always, I’d love to hear any thoughts and feedback. Writing these is very instructive for me to make sure I actually understand the ideas in these papers, and I hope they provide some value to you too.</em></p>Clayton SanfordThis is the fifth of a sequence of blog posts that summarize papers about over-parameterized ML models.[OPML#4] HMRT19: Surprises in high-dimensional ridgeless least squares interpolation2021-07-23T00:00:00+00:002021-07-23T00:00:00+00:00http://blog.claytonsanford.com/2021/07/23/hmrt19<!-- [HMRT19](https://arxiv.org/abs/1903.08560){:target="_blank"} [[OPML#4]](/2021/07/23/hmrt19.html){:target="_blank"} -->
<p><em>This is the fourth of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>This week’s <a href="https://arxiv.org/abs/1903.08560" target="_blank">paper</a>
is one by Hastie, Montanari, Rosset, and Tibshirani, which studies the cases in over-parameterized least-squares regression where the generalization error is small.
It follows in the vein of the papers reviewed so far (<a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, and <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a> <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>), which all present circumstances where such “benign overfitting” takes place in \(\ell_2\)-norm minimizing linear regression.</p>
<p>This summary will be a bit shorter than the previous ones, since a lot of the ideas here have already been discussed.
The paper is highly mathematically involved; it covers a lot of ground and gives theorems that are very general.
However, the core message about when it’s possible for favorable interpolation to occur is similar to that of BLLT19, so I’ll mainly focus on presenting the results of this paper on a high level and explaining the similarities between the two papers.</p>
<p>The paper is also nearly seventy pages long, and there’s a lot of interesting content about non-linear models and mis-specified models (which generalizes the case of double-descent considered in BHX19) that I won’t discuss for the sake of brevity.</p>
<p>The paper differs from BLLT19 because it considers a broader range of data distributions (e.g. samples \(x_i\) need not be drawn from probability distribution with subgaussian tails) and because it lies in an asymptotic regime.
Concretely, the three other papers previously considered give bounds in terms of the number of samples \(n\) and the number of parameters \(p\), where they are taken to be large, but not infinite.
Here, we instead fix some ratio \(\gamma = \frac{p}{n} > 1\) to represent how over-parameterized the model is and ask what happens when \(n, p \to \infty\).
This means that we’ll need to consider subtly different settings than I discussed in my post about <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, because some of those have \(p = \infty\) and another has \(p = \Theta(n \log n)\).
It’s necessary here to only consider numbers of parameters \(p\) that grow linearly with the number of samples \(n\).</p>
<h2 id="data-model">Data model</h2>
<p>The data model is mostly the same as the previous papers, minus the aforementioned differences in distributional assumptions and growth of \(n\) and \(p\).</p>
<p>We draw \(n\) random samples \((x_i, y_i) \in \mathbb{R}^p \times \mathbb{R}\) where \(x_i\) is drawn from distribution with mean \(\mathbb{E}[x_i] = 0\), covariance \(\mathbb{E}[x_i x_i^T] = \Sigma\), and bounded low-order moments.
(This moment assumption is a weaker assumption to make than subgaussianity, which makes these results more impressive.)
For some parameter vector, \(\beta \in \mathbb{R}^p\) and random noise \(\epsilon_i\) with variance \(\sigma\), the label \(y_i\) is set by taking \(y_i = \langle x_i, \beta \rangle + \epsilon_i\).</p>
<p>For simplicity, we’ll assume (as we have before) that \(\Sigma\) is a diagonal matrix with entries \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0\).
That way, we can assume each coordinate \(x_{i,j}\) of \(x_i\) is drawn independently and we can consider how the output of least-squares regression is affected by the variances \(\lambda_1, \dots, \lambda_p\).
(The paper allows \(\Sigma\) to be any symmetric positive definite matrix, and it instead considers the output of least-squares regression in terms of the eigenvalues of \(\Sigma\), rather than the variances of each independent component.)</p>
<p>As mentioned above, for some fixed over-parameterization ratio \(\gamma > 1\), we’ll let \(p = \gamma n\) and let \(n \to \infty\).</p>
<p>Given a training sample collected in input matrix \(X\) and label vector \(y\), the solution to minimum-norm least-squares is the \(\hat{\beta} \in \mathbb{R}^p\) that minimizes \(\|\hat{\beta}\|_2\) and interpolates the training samples: \(X \hat{\beta} = y\).
The goal—like in other papers about over-parameterized least-squares regression—is to bound the expected squared risk of the prediction rule \(\hat{\beta}\) on a new sample \(x\):</p>
\[R_X(\hat{\beta}; \beta) = \mathbb{E}_{x}[(\langle x, \hat{\beta}\rangle - \langle x, \beta\rangle)^2].\]
<p>Like in BLLT19, the analysis works by decomposing this risk into a bias term \(B_X(\hat{\beta}; \beta)\) and a variance term \(V_X(\hat{\beta}; \beta)\).</p>
<h2 id="main-result">Main result</h2>
<p>Their main result is Theorem 2, which shows that as \(n\) and \(p\) become arbitarily large, the bias \(B_X\) and variance \(V_X\) converge to <em>predicted bias</em> \(\mathscr{B}\) and <em>predicted variance</em> \(\mathscr{V}\).
For this bound to hold, they require that for some constant \(M\) that does not depend on \(n\), the largest component variance has \(\lambda_1 \leq M\) and the smallest has \(\lambda_p \geq \frac{1}{M}\).</p>
<p>This means that the variances cannot decay to zero like BLLT19 relies on!
It will still matter that some variances be significantly smaller than others, but not in the same way.</p>
<p>Now, I’ll define the predicted bias and variance and try to explain the inutution behind them:</p>
\[\mathscr{B}(\gamma) = \left(1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\right) \sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]
<p>and</p>
\[\mathscr{V}(\gamma) = \sigma^2 \gamma \mathbf{c_0} \frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}},\]
<p>where \(c_0\) depends on \(\gamma\) and satisfies</p>
\[1 - \frac{1}{\gamma} = \frac{1}{p} \sum_{j=1}^p \frac{1}{1 + \gamma c_0 \lambda_j}.\]
<p><em>Note: The predicted variance differs from the version presented in the paper.
I additionally include the bolded \(c_0\) term, which I suspect was left out as a typo.
In its current version, the bound in Theorem 2 is inconsistent with the specialized bound in Theorem 1, so I suspect that it was just an omission of a variable in the variance statement.</em></p>
<p>If you’re anything like me, you find these expressions a little terrifying and hard to understand.
Let’s break them down into pieces to try to grasp how the value of \(\gamma\) affects the risk as \(n\) and \(p\) become large.</p>
<p>The rough intuition for the impact of over-parameterization on these two terms is that growth of \(\gamma\) hurts bias and helps variance.
However, this doesn’t seem immediately obvious; indeed, the variance appears to <em>grow</em> as \(\gamma\) increases.
It’s necessary to understand the product \(\gamma c_0\) in order to get why this is the case.
We’ll first consider a simple isotropic case to understand what happens to that term, and then hand-wavily revisit the general case.</p>
<h3 id="isotropic-data">Isotropic data</h3>
<p>For simplicity, consider the <em>isotropic</em> or <em>spherical</em> case, where \(\Sigma = I_p\) and \(\lambda_1 = \dots = \lambda_p = 1\).
(“Isotropic” literally translates to “equal change.”)
Then, taking \(c_0 := \frac{1}{\gamma(\gamma - 1)}\) satisfies the condition on \(c_0\).
Now, we can plug in \(c_0 \gamma = \frac{1}{\gamma - 1}\) into the expressions for predicted bias and predicted variance:</p>
\[\mathscr{B}(\gamma) = \left(1 + \frac{1}{\gamma - 1} \right) \frac{1}{(1 + \frac{1}{\gamma-1})^2} \sum_{j=1}^p \beta_j^2 = \frac{\|\beta\|_2^2}{1 - \frac{1}{\gamma -1}} = \frac{\|\beta\|_2^2( \gamma-1)}{\gamma}.\]
\[\mathscr{V}(\gamma) = \frac{\sigma^2}{\gamma - 1}.\]
<p>Thus, as \(\gamma\) becomes larger (and the learning model becomes more over-parameterized), the bias will approach \(\|\beta\|^2\) and the variance will approach zero.
This isn’t really good new for the isotropic case…
The bias rapidly approaches \(\|\beta\|^2\) as \(\gamma\) grows, which will make it impossible for the risk to be small.</p>
<p>It’s possible for the excess risk to decrease as \(\gamma\) grows in the case where the signal to noise ratio \(\frac{\|\beta\|^2}{\sigma^2}\) is large, but the excess risk will still be worse than it would be in parts of the classical regime where \(\gamma < 1\).</p>
<p>As with BLLT19, to see the benefits of overfitting, we need to look at how the variances decay in the anisotropic setting.</p>
<h3 id="intuition-for-the-general-case">Intuition for the general case</h3>
<p>We’ll continue to think of \(c_0 \gamma\) as something that decays to zero as \(\gamma\) becomes large.
If that weren’t the case and \(c_0 \gamma\) were large, then each term \(\frac{1}{1 + \gamma c_0 \lambda_j}\) would be small for \(1 \leq j \leq p\), and it’s impossible for their average to be \(1 - \frac{1}{\gamma}\), since that’s close to one.</p>
<p>Now, we’ll talk through each of the components of the predicted bias and variance to speculate in a hand-wavy way about how this result applies.</p>
<p>Let’s start with the variance term.</p>
<ul>
<li>First, if we think of \(\gamma c_0\) as something like \(\frac{1}{\gamma - 1}\) (or at least something that decays as \(\gamma\) increases), then the variance goes to zero as the model becomes more over-parameterized.
This checks out with our intuition from BLLT19 and BHX19.</li>
<li>Also intuitively, the variance drops if the noise \(\sigma\) drops.
If there’s no noise, then all of the model’s error will come from the bias.</li>
<li>
<p>Now, the hard part.</p>
\[\frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]
<p>will be thought of roughly as corresponding to the rate of component variance decay.
Since \(\gamma c_0\) is small and the variances \(\lambda_j\) are bounded above, most of the \((1 + \gamma c_0 \lambda_j)^2\) terms should be close to 1.
Making that sketchy simplification, we instead have</p>
\[\frac{\sum_{j=1}^p \lambda_j^2}{\sum_{j=1}^p \lambda_j}.\]
<p>This looks sorta similar to the \(R_0(\Sigma)\) term from BLLT, except that it would square the denominator.
The term (and hence, the variance) is small when there’s a gap between the high-variance components and the low-variance components, or when some \(\lambda_j\)’s are much larger than other \(\lambda_j\)’s.
This corresponds to the requirement from BLLT19 that the decay must be sufficiently fast.</p>
</li>
</ul>
<p>Thus, you get small variance if there’s some combination of heavy over-parameterization, low noise, and rapid decay of variances.
Now, we look at bias.</p>
<ul>
<li>
<p>The first term</p>
\[1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]
<p>will be roughly 1 when the model is over-parameterized (because of \(\gamma c_0\)) or when the variances \(\lambda_i\) drop sufficiently fast.</p>
</li>
<li>
<p>The final term</p>
\[\sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]
<p>looks at the correlations between “important” directions in the true parameters \(\beta\) and variances \(\lambda_j\).
If we again treat \((1 + \gamma c_0 \lambda_j)^2 \approx 1\), then this term is \(\sum_{j=1}^p \beta_j^2 \lambda_j\).
This is approximately \(\|\beta\|^2\) (and thus large) if most of the weight of \(\beta\) lies in high-variance directions.
It will then be small if a sufficiently the weight of \(\beta\) is divided into many medium-importance components.
This seems analogous to the BLLT19 requirement that the decay of weights not be too rapid.</p>
</li>
</ul>
<!-- ### One other case
To try to make the intuition for the bias term make sense, I'll go over one more specific case, where different distributions of weight over the parameter vector $$\beta$$ will lead to different levels of acceptable over-parameterization.
-->
<p>Thanks for reading this blog post! As always, let me know if you have thoughts or feedback.
(As of now, there’s no way to comment on the blog. My original attempt with Disqus led to the introduction of a bunch of terrible ads to this blog. I’ll be back with something soon, which will hopefully be less toxic.)</p>Clayton Sanford[OPML#3] MVSS19: Harmless interpolation of noisy data in regression2021-07-16T00:00:00+00:002021-07-16T00:00:00+00:00http://blog.claytonsanford.com/2021/07/16/mvss19<p><em>This is the third of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>This is a <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">2019 paper</a> by Muthukumar, Vodrahalli, Subramanian, and Sahai, which will be known as <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a>.
Like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, it considers the question of when least-squares linear regression performs well in the over-parameterized regime.
One of the great things about this paper is that it goes beyond giving mathematical conditions needed for a low expected risk of interpolation.
It additionally suggests intuitive mechanisms for how it works, which helps motivate the conditions that BLLT19 impose.</p>
<h2 id="overview">Overview</h2>
<p>To recap, we’ve so far studied two settings where double-descent occurs in linear regression:</p>
<ul>
<li>The <em>misspecified setting</em>, where the under-parameterized model lacks access to features of the data that are essential for predicting the label \(y\). BHX19 studies this setting.
Success in the over-parameterized setting depends on the increased access to data components causing decreased variance for the predictors.</li>
<li>The setting where the variances of the components of input \(x\) decay at a rate that is neither too slow nor too fast. This is explored in BLLT19.</li>
</ul>
<p>MVSS19 studies a similar setting to BLLT19 with decreasing variances.
They do so by treating the over-parameterized learning model as the process of choosing between <em>aliases</em>—hypotheses that perfectly interpolate (or fit) the training samples and minimize empirical risk.
As the complexity of a model increases beyond the point of overfitting, the number of aliases increases rapidly, which means that an empirical-risk-minimizing algorithm (like least-squares regression) has many choices of learning rules to choose from, some of which might have good generalization properties.</p>
<p><img src="/assets/images/2021-07-16-mvss19/alias.jpeg" alt="" /></p>
<p>This paper answers two questions for a broad category of data models:</p>
<ol>
<li><strong>What is the population error of the best \(d\)-parameter linear learning rule \(f: \mathbb{R}^d \to \mathbb{R}\) that interpolates all \(n\) training samples (that is, \(f(x_i) = y_i\) for all \(i \in [n]\))?</strong> They answer this question in Section 3 by characterizing the “fundamental price of interpolation.” In doing so, they show that it is essential that \(d \gg n\) for an interpolating solution to perform well. That is, dramatic over-parameterization is necessary for any learning algorithm to obtain a rule that fits the training samples and has a low expected risk.</li>
<li>
<p><strong>When does the over-parameterized least-squares algorithm choose a good interpolating classifier?</strong> While (1) tells us that there exists some alias with low risk when \(d \gg n\), it doesn’t tell us whether this particular learning algorithm will find it. They introduce a framework in Section 4 for analyzing <em>signal bleed</em> (when the true signal present in the training samples is distributed among many aliases, leaving none of them with an adequate amount of signal) and <em>signal contamination</em> (when the noise from the training samples corrupts the chosen alias). This framework justifies the “not too fast/not too slow” conditions from BLLT19 and argues that a gradual decay of variances is necessary to ensure that least-squares obtains a learning rule that neither ignores the signal nor is corrupted by noise.</p>
<p><em>Note: The paper actually considers a general covariance matrix \(\Sigma\) for the inputs \(x_i\) and does not require that each of the \(d\) components be uncorrelated with all others.
Thus, instead of considering the rate of decay of the variances of each independent component, this paper (and BLLT19) instead consider the rate of decay of the eigenvalues of \(\Sigma\).
It’s then possible for favorable interpolation to occur when in cases where every component of \(x_i\) has the same variance, but the eigenvalues of \(\Sigma\) decay at a gradual rate because of correlations between components.</em></p>
</li>
</ol>
<p>They have plenty of other interesting stuff too.
The end of Section 4 discusses Tikhonov (ridge) regression, which adds a regularization terms and does not overfit, but does outperform least-squares interpolation for a proper choice of regularization parameters.
Section 5 focuses on a broader range of interpolating regression algorithms (such as <em>basis pursuit</em>, which minimizes \(\ell_1\) error rather than the \(\ell_2\) error of least-squares) and proposes a hybrid method between the \(\ell_1\) and \(\ell_2\) approaches that obtains the best of both worlds.
However, for the sake of simplicity, we’ll keep this summary to the two questions above.</p>
<h2 id="what-can-go-wrong-with-interpolation">What can go wrong with interpolation?</h2>
<p>Towards answering these questions, the authors identify three broad cases when interpolation approaches fail.</p>
<h3 id="failure-1-too-few-aliases">Failure #1: Too few aliases</h3>
<p>If \(d\) is not much larger than \(n\), then the model is over-parameterized, but only just.
As a result, there are relatively few aliases that interpolate all of the samples \((x_i, y_i)\). (This roughly corresponds to the second and third panels of the above graphic.)
Frequently, none of these will be any good, since they might all fall into the typical pitfalls of overfitting: in order to perfectly fit the samples, the underlying trend in the data is missed.</p>
<p>Noisy labels (\(y_i = \langle x_i, \beta\rangle + \epsilon_i\) for random \(\epsilon_i\) with variance \(\sigma^2\)) exacerbate these issues.
If few aliases are available, most of them will be heavily affected by the noisy samples.
Indeed, the authors of this paper argue that the only way to ensure the existence of an interpolating learning rule that is not knocked askew by the noise is to have many aliases.
Thus, interpolation will not work without over-parameterization; we must require that \(d \gg n\).
More on this later.</p>
<h3 id="failure-2-signal-bleed">Failure #2: Signal bleed</h3>
<p>In this case, we have plenty of aliases, but they’re all different:</p>
<p><img src="/assets/images/2021-07-16-mvss19/fail2.jpeg" alt="" /></p>
<p>The above image shows that there are three different interpolating solutions that fit the orange points, but they are uncorrelated with one another.</p>
<p>(<em>Sidebar: These aliases don’t look like linear functions, but that’s because they’re being applied to the Fourier features of the input. This will be discussed later.</em>)</p>
<p>Suppose the true learning rule is represented by the cyan constant-one alias.
We’re doomed if the learning algorithm chooses the purple or red aliases because those are uncorrelated with the cyan alias and will label the data with no better accuracy than chance.
The least-squares algorithm will produce a learning rule that averages all three together, which will also poorly approximates the true curve.
Thie phenomenon is known as <em>signal bleed</em>, because the helpful signal provided by the data is diluted by being distributed between several aliases that are uncorrelated.</p>
<p>To avoid signal bleed, the learning algorithm needs to somehow be biased in favor of lower-frequency or simpler features.
This is why the BLLT19 paper requires that the variances of each component decay at a sufficiently fast rate.
If they don’t, then there is no way to break ties among uncorrelated aliases, which dooms them to a bad solution.</p>
<h3 id="failure-3-signal-contamination">Failure #3: Signal contamination</h3>
<p>Suppose once again, we’re in a setting with many different aliases, some of which are uncorrelated with one another.
If we consider the noise \(w_i\) added to each label, then every one of the aliases will somehow be corrupted when the noise is added.
Ideally, we want to show that as the number of samples and number of parameters become large, the impact of the noise on the chosen interpolating alias will be minor.</p>
<p>For this to be possible, we have to ensure that the noise is diluted among the different aliases.
This is the opposite of what we want for the signal!
We know that the noise will corrupt the aliases, but if there are many uncorrelated aliases, the corruption can either be relatively evenly distributed among the different aliases (<em>noise dissipation</em>) or concentrated in a few (<em>signal contamination</em>).
The first case can then be used to argue that any alias chosen by the learning algorithm will be minimally affected by noise, which is great!</p>
<p>One way to ensure that noise is diluted among aliases is to impose some degree of similar weight on aliases under consideration.
In the land of BLLT19, this means guaranteeing that the rate of decay of variances is not <em>too</em> fast.
This poses the trade-off explored in BLLT19 and here: There’s a sweet spot in the relative importance of different features from the perspective of the learning algorithm that must be found in order to avoid either signal bleed or signal contamination.</p>
<p><img src="/assets/images/2021-07-16-mvss19/fail3.jpeg" alt="" /></p>
<p>Before jumping in to these results more formally, we introduce two data models that we’ll refer back to.</p>
<h2 id="data-models">Data models</h2>
<p>In both cases, inputs \(x\) are chosen from some procedure and label \(y = \langle x, \beta\rangle + \epsilon\), where \(\beta\) is the unknown true signal and \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) is independent Gaussian noise.
We let \(X \in \mathbb{R}^{n \times d}\), \(Y \in \mathbb{R}^n\), and \(W \in \mathbb{R}^n\) contain all of the training inputs, labels, and noise respectively.</p>
<p>The least-squares algorithm returns the minimum-norm \(\hat{\beta} \in \mathbb{R}^d\) that interpolates the training data: \(X \hat{\beta} = Y\).</p>
<p><em>Note: This notation is slightly different than the notation used in their paper. I modified to make it line up more closely with BHX19 and BLLT19.</em></p>
<h3 id="model-1-gaussian-features">Model #1: Gaussian features</h3>
<p>Every input \(x \in \mathbb{R}^d\) is drawn independenty from a multivariate Gaussian \(\mathcal{N}(0, \Sigma)\), where \(\Sigma \in \mathbb{R}^{d\times d}\) is a covariance matrix.</p>
<h3 id="model-2-fourier-features">Model #2: Fourier features</h3>
<p>For any \(j \in [d]\), we define the \(j\)th <em>Fourier feature</em> to be a function \(\phi_j: [0, 1] \to \mathbb{C}\) with \(\phi_j(t) = e^{2p (j-1) i t}\).
Because \(e^{iz} = \cos(z) + i \sin(z)\), \(\phi_j(t)\) can be thought of as a sinusoidal function with frequency increasing with \(j\).
For any \(t \in [0, 1]\), it’s Fourier features are \(\phi(t) = (\phi_1(t), \dots, \phi_d(t)) \in \mathbb{C}^d\).</p>
<p>Notably, these features are orthonormal and uncorrelated.
That is,</p>
\[\langle \phi_j, \phi_k \rangle = \mathbb{E}_{t \sim \text{Unif}[0, 1]}[\phi_j(x) \phi_k(x)] = \begin{cases}
1 & \text{if } j=k, \\
0 & \text{otherwise.}
\end{cases}\]
<p>To learn more about orthonormality and why its a desirable trait in vectors and functions, check out <a href="/2021/07/16/orthogonality.html" target="_blank">my post</a> on the subject.</p>
<p>We generate the training samples by choosing \(n\) evenly spaced points on the interval \([0, 1]\): \(t_j = \frac{i-1}{n}\) for all \(i \in [n]\).
The features of the \(j\)th sample are \(x_j = \phi(t_j) = (1, e^{2 \pi i t_j}, e^{2\pi(2i) t_j}, \dots, e^{2\pi((d-1)i) t_j}) \in \mathbb{C}^d\).
The feature vectors for each sample are also orthonormal: \(\langle x_j, x_k \rangle = 1\) if \(j = k\) and \(0\) otherwise.</p>
<p>The below image gives a visual of the sinusoidal interpretation of Fourier features and the training samples:</p>
<p><img src="/assets/images/2021-07-16-mvss19/fourier.jpeg" alt="" /></p>
<h2 id="the-necessity-of-over-parameterization">The necessity of over-parameterization</h2>
<p>Section 3 of the paper studies the “fundamental price of interpolation” by asking about how good the best interpolating classifier can be.
Specifically, the ideal test risk of any interpolating classifier:</p>
\[\mathcal{E}^* = \min_{\beta \in \mathbb{R}^d: X \beta = Y} \mathbb{E}_{(x, y)}[(y - \langle x, \beta\rangle)^2] - \sigma^2.\]
<p>The condition \(X \beta = Y\) ensures that \(\beta\) does indeed fit all of the training samples.
The variance of the noise \(\sigma^2\) is subtracted because no classifier can ever hope to have risk better than the noise, since every label will be corrupted.</p>
<p>They prove upper- and lower-bounds on \(\mathcal{E}^*\) that hold with high probability. In particular, by Corollaries 1 and 2, with probability 0.9:</p>
<ul>
<li>Under the Gaussian features model, \(\mathcal{E}^* = \Theta(\frac{\sigma^2 n}{d})\).</li>
<li>Under the Fourier features model, \(\mathcal{E}^* = \Omega(\frac{\sigma^2 n}{d \log n})\).</li>
</ul>
<p>Therefore, in order to guarantee that the risk approaches the best possible as \(n\) and \(d\) grow, it must be the case that \(d \gg \sigma^2 n\).
That is, it’s essential for the model to be over-parameterized for the interpolation to be favorable.
This formalizes Failure #1 by highlighting that without enough aliases (which are provided by having a highly over-parameterized model), even the best alias will have poor performance.</p>
<p>These proofs first use linear algebra to exactly represent \(\mathcal{E}^*\) in terms of inputs \(X\), covariance \(\Sigma\), and noise \(\epsilon\).
Then, they apply concentration bounds to show that the risk is close to its expectation with high probability over the input data and the noise.</p>
<h2 id="not-too-fast-not-too-slow">Not too fast; not too slow</h2>
<p>Here, we recap Section 4 of the paper while studying the Fourier features setting.
In doing so, we explain how Failures #2 and #3 can occur.
We focus on Fourier features because their orthogonality properties make the concepts of signal bleed and signal contamination much cleaner.</p>
<h3 id="signal-bleed">Signal bleed</h3>
<p>Consider a simple learning problem where each \(x\) is a Fourier feature and \(y = 1\) no matter what. (There is no noise here.)
In this case, our samples will be of the form \((\phi(t_1), 1), \dots, (\phi(t_n), 1)\) for \(t_1, \dots, t_n\) evenly spaced in \([0, 1]\).</p>
<p>First, we ask ourselves which solutions will interpolate between the samples.
Since the \(j\)th Fourier feature is the function \(\phi_j(t) = e^{2p (j-1) i t}\), the first Fourier feature \(\phi_1(t) = 1\) is an interpolating alias.
(It’s also the correct alias.)
However, so too will be \(\phi_j\) when \(j-1\) is a multiple of \(n\). This is orthogonal (uncorrlated) to the first feature (and all other Fourier features).
If there are \(d\) Fourier features and \(n\) samples for \(d \gg n\), there are \(\frac{d}{n}\) interpolating aliases, all of which are orthogonal.</p>
<p>This is a problem.
This forces our algorithm to choose between \(\frac{d}{n}\) different candidate learning rules, all of the which are completely uncorrelated with one another, without having any additional information about which one is best.
Indeed, the interpolating learning rule can be any function of the form \(\sum_{j = 0}^{d/n} a_j \phi_{nj+1}(t)\) for \(\sum_{j = 0}^{d/n} a_j = 1\).</p>
<p>How does the least-squares algorithm choose a parameter vector \(\beta\) from all of these interpolating solutions?
It chooses the one with the smallest \(\ell_2\) norm. By properties of orthogonality, this is equivalent to choosing the function minimizing \(\sum_{j = 0}^{d/n} a_j^2\), which is satisfied by taking \(a_j = \sqrt{\frac{n}{d}}\).
This means that \(\beta_1 = \sqrt{\frac{n}{d}}\).
Equivalently, the true feature \(\phi_1\) contributes only a \(\sqrt{\frac{n}{d}}\) amount of influence on the learning rule, which diminishes as \(d\) grows and the model becomes further over-parameterized.</p>
<p>This is why we refer to this failure mode (Failure #2) as <em>signal bleed</em>: the signal conveyed in \(\phi_1\) bleeds into all other \(\phi_{jn + 1}\) until the true signal has almost no bearing on the outcome.</p>
<p><strong>How can this be fixed?</strong> By giving a higher weight to “simpler” features in order to indicate some kind of preference for these features.
The higher weight permits the \(\ell_2\) norm of a classifier to contain a large amount of influence \(\phi_1\) without incurring a high cost.</p>
<p>To make this concrete, let’s rescale each \(\phi_j\) such that \(\phi_j = \sqrt{\lambda_j} e^{2p (j-1) i t}\).
Now, the interpolating aliases are \(\frac{1}{\sqrt{\lambda_j}} \phi_j\) whenever \(j\) is one more than a multiple of \(n\), which means that the higher-frequency features will be more costly to employ.
This time, we can express any learning rule in the form \(\sum_{j = 0}^{d/n} \frac{a_j}{\sqrt{\lambda_j}} \phi_{nj+1}(t)\) for \(\sum_{j = 0}^{d/n} a_j = 1\).
Least-squares will then choose the learning rule whose \(a_j\) values minimize \(\sum_{j = 0}^{d/n} \frac{a_j^2}{\lambda_j}\).
This will be done by taking:</p>
\[a_j = \frac{\lambda_j}{\sum_{k=0}^{d/n} \lambda_{kn +1}},\]
<p>Going back to our Fourier setting where \(\phi_1\) is the only true signal, our classifier will perform best if \(a_1 \approx 1\), which occurs if \(\frac{\lambda_0}{\sum_{k=0}^{d/n} \lambda_{kn +1}} \to 1\) as \(n\) and \(d\) become large.
(The quantity that must approach 1 is known as the <em>survival factor</em> in this paper.)
For this to be possible, there must be a rapid drop-off in \(\lambda_j\) as \(j\) grows.</p>
<p>Interestingly, this coincides with BLLT19’s requirements for “benign overfitting.”
The survival factor coincides is the inverse of their \(r_0(\Sigma)\) term, which captures the gap between the largest variance and the sum of the other variances.
As was discussed in <a href="/2021/07/11/bllt19.html" target="_blank">that blog post</a>, the quantity must be much smaller than \(n\) for their bound to be non-trivial.</p>
<p>Figure 5 of their paper provides a nice visualization of how dropping the weights on high-frequency can lead to better interpolating solutions that avoid signal bleed.
The top plot has a large gap between the weights on the low-frequency features and the high-frequency features, which prevents least-squares from giving too much preference to the high-frequency features that just happen to interpolate the training data.
The bottom plot produces a spiky and inconsistent plot because it fails to do so.</p>
<p><img src="/assets/images/2021-07-16-mvss19/bleed.jpeg" alt="" /></p>
<p>This logic seems circular somehow: in order to have good interpolation, we must be able to select for the good features and weight them strongly enough so that their aliases override orthogonal aliases.
However, if we know the good features, why include the bad features in the first place?
The next section part discusses why it’s important in the interpolation regime to not let the importance of features (represented by \(\lambda_j\)) drop too rapidly.</p>
<h3 id="signal-contamination">Signal contamination</h3>
<p>In the previous section, we were concerned about the “true signal” of \(\phi_1\) being diluted by the preference of least-squares for higher-frequency Fourier features.
To combat that, it was necessary to drop the variances of the high-frequency features by some sequence \(\lambda_j\) that decreases sufficiently quickly.</p>
<p>Here, we’re concerned with the opposite issue: the incorrect influence of orthonormal high-frequency aliases and noise on the learning rule inferred by least-squares.
In this Fourier features setting, all contributions from other aliases will necessarily increase the risk because the other aliases are all orthogonal to the signal \(\phi_1\).
As before, we can quantify the minimum error caused by the inclusion of other aliases in the prediction, which we’ll call the <em>contamination</em>:</p>
\[C = \sqrt{\sum_{k = 1}^{d/n} \hat{\beta}_{kn + 1}^2}.\]
<p>In the case of least-squares regression, we have:</p>
\[C = \frac{\sqrt{\sum_{k=1}^{d/n} \lambda_{kn+1}}}{\sum_{k = 0}^{d/n} \lambda_{kn +1}}.\]
<p>We’re interested in finding weights \(\lambda_j\), which ensure that the contamination \(C\) becomes very small a regime where \(d\) and \(n\) are very large.
One way to do so is to choose \(\lambda_j\) such that \(\sqrt{\sum_{k=1}^{d/n} \lambda_{kn+1}} \ll\sum_{k = 1}^{d/n} \lambda_{kn +1}\), which occurs when the sum of weights is large and the decay of \(\lambda\) is heavy-tailed.
That is, to avoid having spurious features have a lot of bearing on the final learning rule, one can require that \(\lambda\) decays very slowly, so that the lower-frequency spurious features are not given much more weight than the higher-frequency features.</p>
<p>Taken together, this section and the previous section impose a trade-off how features should be weighted.</p>
<ul>
<li>To avoid signal bleeding, it’s necessary for a relatively small number of features to have much more weight than the rest of them.</li>
<li>To avoid signal contamination, the remaining features need to jointly have a large amount of weight and the weights cannot decay too quickly.</li>
</ul>
<p>This is the same trade-off presented by BLLT19 with their \(r_k(\Sigma)\) and \(R_k(\Sigma)\) terms.
For their bounds to be effective, it’s necessary to have that \(r_0(\Sigma) \ll n\) (prevent signal bleed by mandating decay of feature variances) and \(R_{k^*}(\Sigma) \gg n\) where \(k^*\) is a parameter that divides high-variance and low-variance features (prevent signal contamination by requiring that the variances decay sufficiently slowly).</p>
<h2 id="conclusion-and-next-steps">Conclusion and next steps</h2>
<p>Like the other papers discussed so far, the results of this paper apply to a very clean setting.
The Fourier features examples illustrate these contamination-vs-bleed trade-offs in a very clean way because the orthogonality of the features means that all features other than the signal are strictly detrimental.
Still, this paper is nice because it motivates the mathematical conditions specified in BLLT19 and gives more intuition into when one should expect least-squares interpolation to succeed.</p>
<p>The paper suggests that further works focus on the powers of approximation of more complex models and how they relate to success in the interpolation regime.
This is where there’s a key difference between BHX19 and BLLT19/MVSS19.
The over-parameterized models in the former explicitly have more information in comparison to their under-parameterized counterparts, so they have a clear advantage in the kinds of functions they can approximate.
On the other hand, the success of over-parameterized models in BLLT19 and MVSS19 are solely dependent on the relative variances of many features; they don’t say anything about the fact that most over-parameterized models can express more kinds of functions.
The authors hope that future work continues to study interpolation through the lens of signal bleed and signal contamination, but that they also find a way to work in the real approximation theoretic advantages that over-parameterized models maintain over other models.</p>
<p>I personally enjoyed reading this paper a lot, because I found it very intuitive and well-written. I’d recommend checking it out directly if you find this interesting!</p>Clayton SanfordThis is the third of a sequence of blog posts that summarize papers about over-parameterized ML models.