Jekyll2023-07-12T09:51:58+00:00http://blog.claytonsanford.com/feed.xmlClayton’s BlogRandom thoughts about machine learning, algorithms, math, running, living in New York, and more.Clayton SanfordWhat does the ‘R-norm’ reveal about efficient neural network approximation? (COLT 2023 paper with Navid and Daniel)2023-07-12T00:00:00+00:002023-07-12T00:00:00+00:00http://blog.claytonsanford.com/2023/07/12/ahs23<p><em>Well, in the <a href="/2022/07/30/hssv22.html" target="_blank">paper summary</a> I posted last year, I promised more posts, none of which materialized.
So I won’t promise anything this time around; I enjoy writing these posts, but it’s hard to find the time with all of the other grad school and life things going on.</em></p>
<p><em>Anyways, here goes another paper summary on <a href="https://arxiv.org/abs/2206.05317" target="_blank">a work</a> that is published at this year’s COLT in Bangalore (!!).
A big reason for writing this post is to explain the “story” of the paper in as clear a way as possible, without getting too bogged down in the technical details. (Note the time gap between when we posted the paper on arXiv and its publication in July at COLT. Needless to say, it took us a while to figure out the right way to tell the story.)</em></p>
<h2 id="quantifying-efficient-approximation-width-vs-weight-norm">Quantifying efficient approximation: width vs weight norm</h2>
<p>As I’ve discussed in <a href="/2021/08/15/hssv21.html" target="_blank">a past summary</a>, we can consider three rough mathematical problems about deep learning theory: <strong>approximation</strong> (what types of mathematical functions can be represented by neural networks), <strong>optimization</strong> (how gradient-based learning algorithms converge to neural networks that fit the training data), and <strong>generalization</strong> (how a trained network performs on never-before-seen data).</p>
<p>If we focus on approximation, the first question one asks is whether there exists a neural network that approximates a function, and the answer to that is almost always yes, due to famous <a href="https://www.sciencedirect.com/science/article/abs/pii/0893608089900038" target="_blank">universal</a> <a href="https://www.semanticscholar.org/paper/Multilayer-feedforward-networks-are-universal-Hornik-Stinchcombe/f22f6972e66bdd2e769fa64b0df0a13063c0c101" target="_blank">approximation</a> <a href="https://link.springer.com/article/10.1007/BF02551274" target="_blank">results</a>.
The second question to ask is whether there exists a <strong>reasonably sized</strong> neural network that approximates the function.
This question yields more nuanced results, with a wealth of positive and negative results, but its answers hinge on how we define “reasonably sized.”</p>
<p>The typical approach is to quantify the size of a neural network by the size of the graph needed to compute it: that is, by the number of neurons, its width, or its depth.
This shows itself in <a href="/2021/08/15/hssv21.html" target="_blank">my past work on universal approximation</a>, as well as well-known <strong>depth-separation</strong> papers (e.g. <a href="http://proceedings.mlr.press/v49/telgarsky16.html" target="_blank">Telgarsky16</a>, <a href="http://proceedings.mlr.press/v49/eldan16.html" target="_blank">ES16</a>, <a href="http://proceedings.mlr.press/v65/daniely17a.html" target="_blank">Daniely17</a>).
In these works, we regard a two-layer neural network</p>
\[g(x) = \sum_{i=1}^m u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)})\]
<p>with ReLU activations (\(\sigma(z) = \max(0, z)\)) as an efficient approximation of some function \(f\) if \(\|f - g\|\) is small (with respect to some norm, possibly \(L_2\) over some distribution or \(L_\infty\)) and if the width \(m\) is bounded, specifically if \(m = \mathrm{poly}(d)\) for input dimension \(d\) (i.e. \(x \in \mathbb{R}^d\)).</p>
<p>However, this focus on width is not without its drawbacks.
The primary drawback is that even if a low-width solution exists, there’s no guarantee gradient descent will find it.
(Note that when we talk about the <em>bias</em> of a learning algorithm, we’re talking about how it breaks ties. Since we often live in the <a href="/2021/07/04/candidacy-overview.html" target="_blank">over-parameterized regime</a>, there are generally many different networks that attain zero training loss. The bias of the algorithm determines how the learning algorithm selects one.)
More generally, it’s not even apparent that gradient descent should be biased in favor of low-width solutions.
Since we train neural networks by making continuous weight updates rather than “pruning” neurons (by setting specific weights \(u^{(i)} = 0\) algorithmically), it’s unlikely that it will converge exactly to a low-width network if such an approximation exists.</p>
<p>So if not width, how can we quantify efficient approximation?
One alternative approach, given in <a href="https://arxiv.org/pdf/1902.05040.pdf" target="_blank">this paper</a> by Saverese, Evron, Soudry, and Srebro, is to instead quantify neural network size by the norm of its weights.
The \(\mathcal{R}\)-norm quantity proposed by SESS is roughly</p>
\[\|g\|_{\mathcal{R}} = \| w\|_2^2 + \| u \|_2^2 = \sqrt{\sum_{i=1}^m (u^{(i)2} + \| w^{(i)} \|_2^2)}.\]
<p>If we assume that each weight vector satisfies \(\|w^{(i)}\|_2 = 1\) (which we can safely do because the ReLU activation is homogeneous; we can normalize \(w\) while scaling \(w\) and \(b\) accordingly and preserve the same network output), then we can simplify the expression to be</p>
\[\|g\|_{\mathcal{R}} = \| u \|_1.\]
<p>The definition of \(\mathcal{R}\)-norm is a little more technical; check out our paper to see how we define two slightly different variants of the norm.
Notably, our formulation of the R-norm doesn’t count bias terms \(b^{(i)}\), to the chagrin of <a href="http://mjt.cs.illinois.edu/" target="_blank">some researchers</a>.</p>
<p>Given this description, we only care about the width \(m\) in as much as it increases the the \(\mathcal{R}\)-norm of the network; this framing rules out the above concern of a low-width high-weight solution that gradient descent struggles to converge to, while opening the possibility of a very high width (or even infinite-width) neural network with small weight neurons.</p>
<p>Why use this \(\mathcal{R}\)-norm minimizing framing?
For one, it fits nicely with studies of explicit regularization of both neural networks and other models.
When learning linear models of the form \(x \mapsto w^T \phi(x)\) for feature mapping \(\phi: \mathbb{R}^d \to \mathbb{R}^m\), it’s easy to regularize the target function by adding a norm penalty to \(w\).
If the goal is to rely on a small number of the \(m\) features stipulated by \(m\), then adding an \(\ell_1\) penalty term will likely be more effective than some kind of more direct feature pruning approach.
Neural networks often are trained with explicit regularization of the norms of their weights, providing a direct incentive to find a low-norm function that fits the data.
Moreover, there’s evidence that gradient descent has an implicit bias in favor of low-norm solutions; see papers like <a href="https://arxiv.org/abs/2110.08084" target="_blank">this one</a>.</p>
<h2 id="our-key-question-understanding-mathcalr-norm-minimizing-interpolation">Our key question: understanding \(\mathcal{R}\)-norm minimizing interpolation</h2>
<p>This brings us to our key question: What are the properties of the two-layer neural network \(g\) that perfectly fits a dataset while minimizing the R-norm? For a given dataset \(\mathcal{D} = \{(x_i, y_i) \in \mathbb{R}^d \times \mathbb{R}\}\), we characterize solutions to the <em>variational problem (VP)</em></p>
\[\inf_{g} \|g\|_{\mathcal{R}} \text{ s.t. } g(x_i) = y_i \ \forall i \in [n],\]
<p>or to the <em>\(\epsilon\)-variational problem</em></p>
\[\inf_{g} \|g\|_{\mathcal{R}} \text{ s.t. } |g(x_i) - y_i| \leq \epsilon \ \forall i \in [n].\]
<p>We’re not the first group to ask this question:</p>
<ol>
<li><a href="https://arxiv.org/pdf/1902.05040.pdf" target="_blank">SESS</a> showed that when \(d = 1\), the \(\mathcal{R}\)-norm minimizing network \(g\) fitting the dataset is the one whose derivative has minimum total variation, which is satisfied by a piecewise-linear spline interpolation of the dataset.
<img src="/assets/images/2023-07-07-ahs23/spline.jpeg" alt="" />
<a href="https://arxiv.org/pdf/2109.12960.pdf" target="_blank">Subsequent work</a> by Hanin exactly characterizes this one-dimension interpolation regime.</li>
<li>A <a href="https://arxiv.org/pdf/1910.01635.pdf" target="_blank">paper</a> by Ongie, Willets, Soudry, and Srebro characterizes the general \(d\)-dimensional case by relating the \(\mathcal{R}\)-norm to a norm based on the Radon transform, which bounds the total ReLU weight needed to represent a function \(f\) by expressing \(f\) as the as a bunch of integrals over subspaces with different angles. (Think: x-rays, where an scan can be computed by firing a bunch of individual beams through a body at different angles.) (Image below is from OWSS.)
<img src="/assets/images/2023-07-07-ahs23/radon.png" alt="" />
While this equivalence holds, its primary implications are impossibility results (e.g. that two layer networks cannot efficiently represent certain radial functions) rather than quantitative insights into how certain datasets are fit.</li>
</ol>
<p>We decided to focus on a higher-dimensional analogue of (1) that provides more concrete quantitative results than (2), while sacrificing some generality.
We did so by focusing on a particular dataset and showing that the solution to its VP is somewhat unexpected.</p>
<h2 id="our-specific-questions-on-interpolating-parities">Our specific questions on interpolating parities</h2>
<p>The dataset is the <em>parity dataset</em>, \(\mathcal{D} = \{(x, \chi(x)): x \in \{\pm 1\}^d\}\), where \(\chi: x \in \{\pm 1\} \mapsto \prod_{j=1}^d x_j \in \{\pm1\}\), illustrated below for \(d = 4\).
<img src="/assets/images/2023-07-07-ahs23/parity.jpeg" alt="" />
Why this one?</p>
<ul>
<li>It’s a \(d\)-dimensional function, and every input has significant bearing on the output; flipping a single bit of \(x\) will flip the entire output of \(\chi\).</li>
<li>Yet, it has a “low intrinsic dimensionality.” \(\chi\) is a <em>ridge function</em>, which means that there exists some \(v \in \mathbb{R}^d\) and \(\psi: \mathbb{R} \to \mathbb{R}\) such that \(\chi(x) = \psi(v^T x)\). That is, there exists a one-dimensional input projection that exactly determines the output.</li>
<li>For parity, there are \(2^d\) possible choices of \(v\): any vector in \(\{\pm 1\}^d\). This follows from the fact that \(\chi\) is symmetric up to bit flips: \(\chi(v \odot x) = \chi(v) \chi(x)\).</li>
<li>The scalar function \(\psi: \mathbb{R} \to \mathbb{R}\) is a <em>sawtooth function</em>, which oscillates \(O(d)\) times between \(-1\) and \(1\).
<img src="/assets/images/2023-07-07-ahs23/sawtooth.jpeg" alt="" /></li>
</ul>
<p>The “ridgeness” of the dataset is central to the motivation.
Learning such functions with gradient descent is well-studied (including by <a href="https://arxiv.org/abs/2210.15651" target="_blank">a paper</a> by yours truly with some collaborators at NYU).
Neural networks trained in the <a href="https://arxiv.org/pdf/2206.15144.pdf" target="_blank">feature learning regime</a> have substantial generalization advantages on inputs labeled by intrinsically-low dimensional targets.</p>
<p>Learning parity functions (specifically, given a dataset \(\{(x_i, y_i)\}_{i \in [n]}\), trying to find a subset of variables \(S \subset [n]\) such that \(\chi_S(x) = \prod_{j \in S} x_j\) best describes the dataset) is a well-known and difficult learning problem, for both neural networks and general ML theory approaches.
While parity can be learned in the noiseless case (where always \(y_i = \chi_S(x_i)\)) using Gaussian elimination using \(n = d\) samples, it’s considered a hard problem in the noisy setting (where a small fraction have flipped labels), and it’s the classic example of a learning problem with a high statistical query complexity. (See <a href="/2022/07/30/hssv22.html" target="_blank">last year’s post</a> for a brief intro to SQ.)
Gradient-based algorithms for learning even sparse parities (where \(|S| \ll d\)) with neural networks are computationally intensive, as is illustrated by <a href="https://arxiv.org/abs/2207.08799" target="_blank">this paper</a> among others.</p>
<p>There are two major questions we study about this dataset:</p>
<ol>
<li>[<strong>Approximation</strong>] What is the intrinsic dimensionality of the \(\mathcal{R}\)-norm minimizing interpolation of the dataset labeled by \(\chi\)? Specifically, since the data can be labeled by some ridge function, is the network \(g\) a ridge function as well?</li>
<li>[<strong>Generalization</strong>] What is the sample complexity of learning noiseless parities labeled by \(\chi_S\) for unknown \(S \subset [n]\) using \(\mathcal{R}\)-norm minimization as a learning algorithm?</li>
</ol>
<h2 id="question-1-whats-the-most-efficient-approximation">Question 1: What’s the most efficient approximation?</h2>
<p>When we first started thinking about this problem, our first thought was that a ridge dataset of this form would have the same message as the SESS and Hanin papers: \(g\) would take the form of a ridge function \(g(x) = \phi(v^T x)\) that performs linear spline interpolation on the samples. (For simplicity, let’s assume that \(d\) is even.)</p>
<p>Given the properties of parity discussed above, a reasonable guess is that that’ll be attained by the sawtooth interpolation \(g(x) = \psi(v^T x)\). To write this as a neural network in the desired form, we take \(v = \vec{1} = (1, \dots, 1)\) and let</p>
\[g(x) = \sqrt{d}\sigma\left(\left\langle \frac{\vec1}{\sqrt{d}}, x \right\rangle + \frac{d+1}{\sqrt{d}}\right) - 2\sqrt{d}\sum_{j=0}^d (-1)^j \sigma\left(\left\langle \frac{\vec1}{\sqrt{d}}, x \right\rangle + \frac{d - 2j}{\sqrt{d}}\right).\]
<p>We visualization the ReLU construction of the sawtooth as blue and purple curves, added up to make full red sawtooth.
<img src="/assets/images/2023-07-07-ahs23/sawrelu.jpeg" alt="" /></p>
<p>We can verify that (1) this is a valid 2-layer neural network with unit-norm weight vectors \(\vec1 / \sqrt{d}\), (2) the single ReLU outside of the sum ensures that \(g(-\vec1) = (-1)^d = 1\), and (3) the sum of alternate-sign ReLUs ensures causes more to activate as as the inner product \(\langle \vec1, x\rangle\) increases and ensures that \(g(x) = \chi(x)\) everywhere else.
Then, we can compute the \(\mathcal{R}\)-norm by computing the \(\ell_1\) norm of the top layer weights:</p>
\[\|g\|_{\mathcal{R}} = \sqrt{d} + (d+1) \cdot 2 \sqrt{d} = O(d^{3/2}).\]
<p>With the benchmark of an \(\mathcal{R}\)-norm of \(d^{3/2}\), we can then ask about its optimality.</p>
<p><strong>First follow-up: Can we do better with other ridge functions?</strong></p>
<p>That is, is there some other \(g(x) = \phi(v^T x)\) having \(\|g\|_{\mathcal{R}} = o(d^{3/2})\)?</p>
<p>Theorem 3 of our paper shows that we <em>cannot</em> do better with another ridge function.
It relies on the key fact proven by SESS: that when \(g\) is a ridge function, \(\|g\|_{\mathcal{R}} = \int_{-\infty}^\infty |\phi''(z)| dz\).
While the second derivative does not exist for functions like \(\phi\) composed of ReLU activations, we can circumvent the problem by noting that the total variation of \(\phi'\) can also be lower-bounded by showing that \(\phi\) must oscillate a certain number of times on a sufficiently short interval.
It then suffices to show that for any choice of \(v \in \mathbb{R}^d\), \(g\) would have to oscillate between \(-1\) and \(1\) at least \(d\) times on an interval length at most \(\sqrt{d}\).
We accomplish this by picking a direction \(w\), selecting a subset of \(d+1\) samples oriented in that direction with alternating sign outputs, and using the mean value theoremt to show that the derivative of \(\phi\) must take on alternately large and small values as well.</p>
<p><img src="/assets/images/2023-07-07-ahs23/lb.jpeg" alt="" /></p>
<p><strong>Second follow-up: Can we do better with any other functions, if they aren’t ridge?</strong></p>
<p>This part came as a surprise to us; we initially thought that the ridge was the best possible solution, but it turns out there’s a way to do better by sacrificing width and taking advantage of the symmetry of the partiy function.</p>
<p>In Theorem 4, we show that there exists a neural network \(g\) having \(\|g\|_{\mathcal{R}} = O(d)\) that perfectly fits the dataset.
We construct \(g\) to be a function of width \(O(2^d)\) that averages together “single-bladed sawtooths” in each of the \(2^d\) directions:</p>
\[g(x) = \frac{Q}{2^d} \sum_{v \in \{\pm 1\}^d} \chi(x) s(v^T x),\]
<p>where \(s(z)\) is a piecewise linear function with \(s(0) = 1\) and \(s(z) = 0\) for \(\lvert z\rvert\geq 1\) that can be represented with three ReLUs and \(Q\) is a normalization quantity to be fixed later.
<img src="/assets/images/2023-07-07-ahs23/single.jpeg" alt="" />
The key insight is that each \(s(v^T x)\) has an \(\mathcal{R}\)-norm of \(O(\sqrt{d})\) and will correctly label a roughly \(\frac1{\sqrt{d}}\) fraction of all inputs.
(For intuition, note that that the probability of a \(\mathrm{Bin}(n, \frac12)\) random variable returning \(\frac{n}2\) is also roughly \(\frac1{\sqrt{d}}\).)
Then, we can let \(Q = \Theta(\sqrt{d})\), which ensure that \(\|g\|_{\mathcal{R}} = O(d)\).</p>
<p>If the gap between this construction and the full sawtooth ridge construction is surprising, here’s a little bit of intuition. We can think of any ReLU in the construction as a having a “cost” according to its coefficient \(u^{(i)}\), and our aim is to have a low ratio of cost the ratio of the dataset that is perfectly fit due to the existence of the ReLU.</p>
<ul>
<li>Since a \(\frac{1}{\sqrt{d}}\) fraction of the samples have \(v^T x = 0\) for any fixed \(v \in \{\pm q\}^d\), the all of the ReLUs used to construct the “averaged single sawthooths” construction is cost-efficient.</li>
<li>Due to basic binomial concentration, an exponentially small fraction of samples have \(\lvert\vec{1}^T x\rvert \geq C\sqrt{d}\). This means that \(d - C\sqrt{d}\) of the ReLUs in the sawtooth ridge construction have the same high cost as the others, but cause extremely few samples to be perfectly fit.</li>
</ul>
<p>Thus, the magic of the averaging construction comes from the fact that we’re getting maximum usage out of each ReLU.</p>
<p><em>Note: In case the high width of the construction is offputting, we have a construction for the \(\epsilon\)-approximate variational problem in Theorem 5 with width \(m = \tilde{O}(d^{3/2} /\epsilon^2)\).</em></p>
<p><strong>Third follow-up: Is \(O(d)\) the best possible \(\mathcal{R}\)-norm for parity interpolation?</strong></p>
<p>Theorem 6 of our paper concludes that this is the case: any \(g\) that even approximates the dataset to accuracy \(\frac12\) over \(L_2\) distance must have \(\mathcal{R}\)-norm at least \(\frac{d}{16}\).
The main step of the proof places an upper bound on the maximum correlation any single ReLU neuron can have with \(\chi\).</p>
<p>Taken together, these results provide a clear quantitative separation on the parity dataset between the suboptimality of the representational cost of using ridge functions to approximate the parity dataset.</p>
<ul>
<li>\(\inf\{\|g\|_{\mathcal{R}}: g(x_i) = y_i \ \forall i \in [n]\} = \Theta(d)\).</li>
<li>\(\inf\{\|g\|_{\mathcal{R}}: g(x_i) = y_i \ \forall i \in [n], \ g(x) = \phi(v^T x)\} = \Theta(d^{3/2})\).</li>
</ul>
<p><strong>Are we too fixated on parity?</strong></p>
<p>Maybe you think this is cool, but maybe you’re also concerned about the dependence on the parity dataset. After all, the parity function has all kinds of crazy symmetries gives \(2^d\) different symmetric sawtooth functions achieving optimal \(\mathcal{R}\)-norms among ridge functions.
Why shouldn’t there be something strange going on?</p>
<p>We had those concerns too, so we developed Section 5 of the paper as well, which translates several of the results to more generic sinusoidal functions on \(\{\pm 1\}^d\).
For these datasets, there’s no such symmetry, and there’s a much more natural ridge interpretation in the single direction of lowest frequency.
And yet, an average of truncated sawtooths of varying frequency is still the optimal thing to do.
To us, this presents a fundamental tradeoff in efficient representation between low intrinsic dimensionality and averaging together partial solutions.</p>
<h2 id="question-2-can-mathcalr-norm-minimization-learn-parities">Question 2: Can \(\mathcal{R}\)-norm minimization learn parities?</h2>
<p>So far, we’ve thought about solving the variational problem on a dataset that labels all \(2^d\) points of the form \((x, \chi(x))\).
Now, we shift our interest towards learning \(\chi_S\) given \(n\) independent samples \(\mathcal{D} = \{(x_i, \chi_S(x_i)): i \in [n]\}\) with \(x_i\) drawn uniformly from \(\{\pm 1\}\).
This is a more traditional learning setting, where the learning algorithm chooses the neural network \(g\) that solves the variational problem on \(\mathcal{D}\).</p>
<p>In a sense, we’re trying to analyze neural networks while avoiding analyzing gradient descent (which can be oh so ugly).
If we assume that our gradient-based optimization method (either due to explicit regularization or inductive bias) convergence to the \(\mathcal{R}\)-norm minimizing interpolant, then we can assess its success at learning parities.</p>
<p>Note that we’ve now shifted our orientation from approximation to generalization.
The best possible sample complexity \(n\) we can hope for is \(n = O(d)\), since Gaussian elimination cannot be beaten for noiseless parities.</p>
<p><strong>The positive result</strong></p>
<p>Theorem 9 of our paper shows that with \(n = O(d^3 / \epsilon^2)\) samples, then the solution \(g\) to the variational problem (with an appended “clipping function” that reduces its outputs to the interval \([-1, 1]\)).
This is a pretty straight-forward bound that relies on Rademacher complexity techniques.
We’re able to characterize expressive capacity of the family of functions produced by solving the variational problem by taking advantage of the fact that their norms are bounded.
From there, the derivation of generalization bounds follow standard ML techniques.</p>
<p><strong>The negative result</strong></p>
<p>On the other hand, Theorem 7 suggests that \(\mathcal{R}\)-norm minimizing interpolation will fail with substantial problem any time \(n \ll d^2\).
This means that the learning algorithm—while it still works for a polynomial number of samples—is suboptimal in terms of sample complexity.
What we prove is actually stronger than that: the \(L_2\) distance between the network \(g\) and any parity function will be nearly 1, which means that there’s no meaningful correlation between the two.</p>
<p>This result is a bit more methodologically fun, mainly because it draws on our approximation-theoretic results.</p>
<ul>
<li>We construct a neural network \(h\) that perfectly fits \(n \leq d^2\) random samples that has \(\|h\|_{\mathcal{R}} = \tilde{O}(n / d)\). This uses the same construction as a low-Lipschitz neural network presented by <a href="https://arxiv.org/abs/2009.14444" target="_blank">Bubeck, Li, and Nagaraj</a>, which uses a single ReLU to fit each individual sample.
With high probability, each of these ReLUs is active for at most one sample and has a low-weight coefficient.</li>
<li>This means that solving the variational problem must return a network \(g\) with \(\|g\|_{\mathcal{R}} \leq \tilde{O}(n / d)\).</li>
<li>However, our Theorem 6 implies that if \(\|g\|_{\mathcal{R}} \ll d\), then it can’t even correlate with parities \(\chi_S\) for \(\lvert S\rvert = \Theta(d)\). Hence, if \(n \ll d^2\), we have no hope of approximating parity from the \(n\) samples.</li>
</ul>
<p><strong>What about the gap?</strong></p>
<p>We have both upper and lower bounds, but the story is not complete, since there’s a \(d^2\) vs \(d^3\) gap on the minimum sample complexity needed to learn with \(\mathcal{R}\)-norm minimization.
We’re not sure if there’s a nice way to close the gap, but we think it’s worth noting that the BLN paper itself has studies an open question about the minimum-Lipschitz network of bounded width that fits some samples.
The construction we draw on in the proof of our negative result might be suboptimal, in which case we might be able to lift our \(d^2\) bound with a more efficient construction.</p>
<h2 id="where-does-this-leave-us">Where does this leave us?</h2>
<p>So that’s the summary of this slightly strange paper, which considers generalization and approximation of a very specific learning problem on neural networks with a very specific kind of regularization.
This work ties in tangentially to several different avenues of neural network theory research: inductive bias, approximation separations, adaptivity, parity learning, ensembling, and intrinsic dimensionality.
Our intention was to elucidate the tension between optimizing for low-width (which project onto single directions) and low-norm representations (which compute averages from many different directions).
We think there’s certainly more work to be done within this convex hull, and here are some questions we’d love to see answered:</p>
<ul>
<li>Parity is a special target function to consider due to its simultaneous low intrinsic dimensionality and high degree of symmetry. We’d be interested in learning ways of defining low intrinsic dimensionality for functions with various symmetries that go beyond ridge or single-index properties. Perhaps we can get similarly strong approximation and generalization properties for functions of these form.</li>
<li>How central is averaging or ensembling to the story? Our min-\(\mathcal{R}\) norm parity network averages near-orthogonal partial solutions together. Given the wealth of literature on boosting and generally improving the caliber of learning algorithms via ensembling, it’s possible that there’s some kind of benefit that can be formalized by a min-norm characterization. (This vaguely reminds me of a <a href="/2021/07/16/mvss19.html" target="_blank">benign overfitting paper</a> by Muthukumar et al that looks at how the successes of minimum-norm interpolation can be analyzed by exploring how minimizing the \(\ell_2\) norm disperses weight in the direction of orthogonal linear features that perfectly fit the data.)</li>
<li>Of course, we’d love to see our generalization gap closed.</li>
</ul>
<p>With that, thank you for taking the time to read this blog post! I’m always happy to hear comments, questions, and feedback. And hopefully there’s another post before next year.</p>Clayton SanfordWell, in the paper summary I posted last year, I promised more posts, none of which materialized. So I won’t promise anything this time around; I enjoy writing these posts, but it’s hard to find the time with all of the other grad school and life things going on.How hard is it to learn an intersection of halfspaces? (COLT 2022 paper with Rocco, Daniel, and Manolis)2022-07-30T00:00:00+00:002022-07-30T00:00:00+00:00http://blog.claytonsanford.com/2022/07/30/hssv22<p><em>Whoops, six months just went by without a blog post.
The first half of 2022 has been busy, and I hope to catch up on writing about some of the other theory papers I’ve worked on, the climate modeling project I’m currently involved in at the Allen Institute for AI, random outdoor adventures and hikes in Seattle, and some general reflections on finishing my third year and having a “mid-PhD crisis.”
For now though, here’s a summary of <a href="https://arxiv.org/abs/2202.05096" target="_blank">a paper</a> that appeared at COLT (Conference on Learning Theory) a couple weeks ago in London.
This is similar to my <a href="/2021/08/15/hssv21.html" target="_blank">two</a> <a href="/2021/12/07/ash21.html" target="_blank">previous</a> summary posts, which are meant to break down papers I’ve helped write into more easily-digestible chunks. As always, questions and feedback are greatly appreciated.</em></p>
<p>Last August, I attended COLT 2021 for the first time to present my first paper as a graduate student, which I wrote along with my advisors,
<a href="http://www.cs.columbia.edu/~rocco/" target="_blank">Rocco Servedio</a> and <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>, and my fellow PhD student, <a href="http://www.cs.columbia.edu/~emvlatakis/" target="_blank">Manolis Vlatakis-Gkaragkounis</a>.
COLT was one of the first conferences to be in-person, so I was fortunate enough to spend a week in Boulder, CO and get to know other ML theory researchers over numerous talks and gorgeous hikes.
This year, the same set of authors sent a second paper to <a href="http://learningtheory.org/colt2021/" target="_blank">COLT 2022</a>—in part due to our desire for funded travel to London—and I was there last week to present it in-person.</p>
<p><img src="/assets/images/2022-07-01-hssv22/talk.JPG" alt="" style="width:50%" /></p>
<p>My talk was a little silly.
Since I presented it on July 4th, I had this whole American Revolution analogy, about the four of us thwarting the attempts of British soldiers to spy on the continental army. I’ll make a few references to that in this post.</p>
<p><img src="/assets/images/2022-07-01-hssv22/rev.png" alt="" /></p>
<p>While both this paper and my COLT 2021 paper are machine learning theory papers, they differ substantially in their areas of focus:
The paper last year (which I’ll call HSSV21) was about the approximation capabilities and limitations of shallow neural networks
This year’s paper (HSSV22) is completely detached from neural networks and instead focuses on a “classical learning theory” question about resource-intensive an algorithm must be to learn a seemingly-simple family of functions.
Indeed, this is different from all of the other papers I’ve worked on in grad school, as they all focus on neural networks and relevant learning theory in various ways: <a href="https://arxiv.org/abs/2102.02336">approximation</a>, <a href="https://proceedings.neurips.cc/paper/2021/hash/26d4b4313a7e5828856bc0791fca39a2-Abstract.html">over-parameterization</a>, <a href="https://arxiv.org/abs/2110.10295">approximation again</a>, and <a href="https://arxiv.org/abs/2206.05317">implicit biases</a>.</p>
<h1 id="what-does-this-paper-actually-do">What does this paper actually do?</h1>
<p>The easiest way to explain what the paper does is by breaking down its title–<strong>Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals</strong>–piece by piece and discussing what each piece means.</p>
<blockquote>
<p>Near-Optimal Statistical Query Lower Bounds for <strong>Agnostically Learning</strong> Intersections of Halfspaces with Gaussian Marginals</p>
</blockquote>
<p>To formalize a classification learning problem, we draw labeled samples \((x, y)\) with input \(x \in \mathbb{R}^n\) and label \(y \in \{\pm 1\}\) from some distribution \(\mathcal{D}\).
We are given a <em>training set</em> of \(m\) independent samples \((x_1, y_1), \dots, (x_m, y_m)\) from that distribution, and our goal is to infer some <em>hypothesis function</em> \(f: \mathbb{R}^n \to \{\pm1\}\) that not only correctly categorizes most (if not all) of the training sample but also <em>generalizes</em> to new data by having a loss \(L(f) = \mathbb{E}_{(x, y) \sim \mathcal{D}}[{1}\{f(x) \neq y\}]\) that is close to zero.</p>
<p><strong>Realizable learning</strong> refers to a category of classification problems in which there is <em>guaranteed</em> to exist a hypothesis in some set \(\mathcal{H}\) that perfectly classifies all data. That is, there exists \(h \in \mathcal{H}\) such that \(y = h(x)\) always. We consider a realizable learning algorithm successful if the loss of its returned predictor \(f\) is not much larger than zero, that is \(L(f) \leq \epsilon\) for some small \(\epsilon > 0\).</p>
<p><em>Note that \(f\) is <strong>not</strong> necessarily contained in \(\mathcal{H}\). If we require \(f\) to be in \(\mathcal{H}\), then the problem is known as <strong>proper</strong>, but that is a separate type of learning problem that we don’t consider in the paper or this blog post.</em></p>
<p><strong>Agnostic learning</strong> is a more difficult regime where there may not exist such an \(h\) that perfectly classifies every sample. As a result, the best we can do is to obtain a loss that is not much larger than <em>the optimal loss</em> among all classifiers in \(\mathcal{H}\), which may be much larger than 0. That is, \(L(h) \leq \epsilon + \mathrm{OPT}\) where \(\mathrm{OPT} = \min_{h \in \mathcal{H}} L(h)\).</p>
<p>For simplicity in this blog post, we’re going to let \(\epsilon = 0.01\) throughout. We can prove everything we need for general \(\epsilon\); check out the paper if you’re interested in that.</p>
<p>In this paper, we focus exclusively on the agnostic setting.
Agnostic learning is a “harder” problem in that it strictly generalizes the realizable class, and some hypothesis classes (such as parities) have efficient algorithms for the realizable case but none known for the agnostic case.</p>
<p>at least as many samples and time are needed to agnostically learn a hypothesis family \(\mathcal{H}\) as is needed to realizably learn it.</p>
<p>For our themed talk, we recast the problem of learning as the challenge of spying on an army to determine where troops are concealed. In this case, the spy receives information about specific locations at random (e.g. “there is a soldier at this location; there is not a soldier at that location”) and must infer a good estimate about where the rest of the army is likely to be based on these samples.</p>
<blockquote>
<p>Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of <strong>Halfspaces</strong> with Gaussian Marginals</p>
</blockquote>
<p>A <strong>halfspace</strong> is a function that is positive on one side of a separating hyperplane and negative on the other. That is, \(h(x) = \mathrm{sign}(w^T x - b)\) for direction \(w \in \mathbb{R}^n\) and bias \(b \in \mathbb{R}\). The family of all halfspaces \(\mathcal{H}_1 = \{x \mapsto \mathrm{sign}(w^T x - b) : w, b\}\) is well-established in ML theory via linear separability.</p>
<p><img src="/assets/images/2022-07-01-hssv22/halfspace.jpeg" alt="" /></p>
<p>By combining the definition of a halfspace with those of the two learning regimes discussed above, we can see clearly how they differ.
Because the realizable setting requires the existence of some \(h \in \mathcal{H}_1\) that perfectly classifies the data, the problem of realizably learning halfspaces always involves having <em>linearly separable</em> data.
An algorithm realizably learns halfspaces if it returns a function (not necessarily a halfspace!) with loss at most 0.01.
This can be done with a maximum margin support vector machine (SVM) with \(\mathrm{poly}(n)\) sample and time complexity.</p>
<p>On the other hand, <em>agnostically</em> learning intersections of halfspaces introduces no dataset restrictions and asks the learner to pick a hypothesis with loss no more than 0.01 plus that of the best possible halfspace.</p>
<p><img src="/assets/images/2022-07-01-hssv22/realag.jpeg" alt="" /></p>
<blockquote>
<p>Near-Optimal Statistical Query Lower Bounds for Agnostically Learning <strong>Intersections of Halfspaces</strong> with Gaussian Marginals</p>
</blockquote>
<p>Likewise, an <strong>intersection of halfspaces</strong> is a function that evaluates to 1 if and only if \(k\) different halfspaces all are 1, i.e. \(h(x) = \min_{i \in [k]} \mathrm{sign}(w_i^T x - b_i)\) and \(\mathcal{H}_k = \{x \mapsto \min_{i \in [k]} \mathrm{sign}(w_i^T x - b_i): w_1, \dots, w_k, b_1, \dots, b_k\}\).</p>
<p><img src="/assets/images/2022-07-01-hssv22/int.jpeg" alt="" /></p>
<p>As will be discussed soon, the problem of learning intersections of halfspaces is also well-studied, in part due to the idea of the <strong>credit allocation problem.</strong>
The idea of this problem is that a common classification regime might be one where a positive classification is <em>only</em> obtained if a larger number of Boolean conditions are satisfied.
For instance, risk assessment decisions may depend more on ruling out individuals who display <em>any</em> of a number of risk factors without notifying them of which condition they failed to meet.
If each of those decision rules is a simple (in this case, a halfspace) combination of one’s information, then the task of learning intersections of halfspaces is relevant to learning in this very specific case.</p>
<p>In addition, for our <em>extremely applicable</em> American Revolution application, the continental army is primarily capable of arranging its troops in \(\mathbb{R}^n\) space behind a collection of battle lines, a.k.a. within an intersection of halfspaces.
As such, it’s the goal of the British spies to learn the troop formations by trying to have an estimate that is nearly as good as the intersection of halfspaces that most accurately categorizes where soldiers are and where they are not.</p>
<blockquote>
<p>Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces <strong>with Gaussian Marginals</strong></p>
</blockquote>
<p>In the definition of agnostic and realizable learning, we did not restrict the data distribution \(\mathcal{D}\).
Now, we’re going to further simplify things by considering only one of the easiest cases, where the probability distribution is the \(n\)-dimensional multivariate standard Gaussian, \(\mathcal{N}(0, I_n)\).</p>
<p>Why do we focus on this simple case? Two quick things:</p>
<ol>
<li>Recall that if we’re proving lower bounds (or showing that solving a problem is highly resource-intensive) than it’s more difficult to prove a result for an “easier” distribution. This means that there doesn’t even exist an algorithm for this very natural data distribution, which is more interesting than showing the same for some very obscure distribution.</li>
<li>The problem of agnostically learning halfspaces is already known to be very difficult to solve in the case when distributional assumptions are not made. See <a href="http://web.cs.ucla.edu/~sherstov/pdf/opthshs.pdf" target="_blank">this paper</a> by Sherstov, which shows that even learning \(\mathcal{H}_2\) requires \(\exp(n)\) time (under cryptographic assumptions).</li>
</ol>
<blockquote>
<p>Near-Optimal <strong>Statistical Query</strong> Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals</p>
</blockquote>
<p>I’m going to tweak the learning model once again, this time replacing the reliance on explicit samples drawn from a probability distribution with the <strong>statistical query (SQ)</strong> model where the learning algorithm instead requests information about the learning problem in the form of queries and receives information with some accuracy guarantee.</p>
<p>That is, an SQ learning algorithm is one that makes \(M\) queries of the form \((q, \tau)\) where \(q: \mathbb{R}^n \times \{\pm1\} \to [-1, 1]\) is a bounded <em>query function</em> and \(\tau > 0\) is the <em>allowable error</em> or <em>tolerance</em>.
We say that an SQ algorithm is <em>efficient</em> if its query count \(M\) and inverse tolerance \(1 / \tau\) are polynomial in \(n\).</p>
<p>Our British intelligence system then—rather than obtaining individual data points about whether a soldier is a present at a given location—asks its comprehensive network of spies a question about some aggregate property of the entire army and receives an accurate response.</p>
<p>To be clear, no one <em>actually</em> operates within the statistical query model, where we ask questions and obtain (potentially adversarial) answers; the whole point of ML is that predictions are purely based on data.
But it’s a useful model for <em>understanding</em> ML algorithms and their limitations.</p>
<p><strong>Why do we use the SQ model?</strong>
We analyze it because it’s similar in functionality to the standard sample-based model, and its limitations are much easier to understand mathematically.</p>
<p><strong>Most sample-based learning algorithms are SQ learning algorithms.</strong> Many learning algorithms (such as stochastic gradient descent and ordinary least squares (OLS) regression) depend on using a sufficiently large number of samples to estimate certain expectations and moments.
For instance, gradient descent involves using a <em>batch</em> of \(m'\) samples to estimate the expected gradient \(\mathbb{E}_{(x, y)}[\nabla_\theta \ell(y, h_\theta(x))]\) of the loss function \(\ell\) for a parameterized function \(h_\theta\) (typically a neural network) and using this to iteratively update the parameter \(\theta\).
The gradient estimate is essentially the empirical average, \(\sum_{i=1}^{m'} \nabla_\theta \ell(y_i, h_\theta(x_i))\) (it may also include other regularization terms or a momentum term, but we’ll ignore that now).</p>
<p>We can implement a similar algorithm in the SQ model by replacing each gradient computation with the query \(q_j(x, y)= \nabla_{\theta_j} h_\theta(y, h_\theta(x))\) for each \(j \in [n]\) and a small tolerance \(\tau\).
Because both algorithms provide similar estimates of the same quantity, we can use the SQ model to have the same effect as the sample-based model.
There are a few notable exceptions of sample-based learning algorithms that <em>cannot</em> be implemented in the SQ model, such as the Gaussian elimination algorithm for learning parities; indeed <em>no</em> SQ algorithm can learn parities.
However, Gaussian elimination does not work when the data are noisy (i.e. a small fraction of samples are labeled incorrectly), and cryptographic hardness results suggest that there is no such sample-based algorithm for learning parities.
<!-- As a result, one can observe that an SQ algorithm works when there already exists a sample-based algorithm that is somewhat noise tolerant. --></p>
<p><strong>Every SQ learning algorithm can be implemented in the sample-based model.</strong>
How? Suppose we have an SQ learning algorithm with \(M\) queries of tolerance \(\tau\).
We can simulate the \(i\)th query \(q_i\) with \(m_i = 2 \ln(100M ) / \tau^2\) samples \((x_1, y_1), \dots (x_{m_i}, y_{m_i})\) by outputting some \(\hat{Q}_i := \frac1{m_i} \sum_{i=1}^{m_i} q(x_i, y_i)\).
By Hoeffding’s inequality, with probability at least \(1 - \frac{1}{100M}\),</p>
\[|\mathbb{E}[q_i(x, y)] - Q_i| \leq \sqrt{2 \ln (100M) / m_i} = \tau.\]
<p>As a result, the outcome \(Q_i\) of every query \(q_i\) is within the desired tolerance \(\tau\) with probability at least \(0.99\), which makes the sample-based algorithm work at simulating the SQ algorithm with \(m := M \cdot m_i = 2M \ln(M / 100) / \tau^2\) samples.
If \(M\) and \(\frac{1}\tau\) are polynomial, then so is the total number of samples \(m\).
The run-time of the algorithm is additionally polynomial.</p>
<blockquote>
<p>Near-Optimal Statistical Query <strong>Lower Bounds</strong> for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals</p>
</blockquote>
<p>In our analogy, we’re more interested in helping the revolutionaries defeat the British by outwitting their master spy web.
That is, we want to understand the limitations of their ability to query their spy network for information about where American troops are located.
In doing so, the Americans can understand what kind of troop formations to consider in order to make it impossible for the Brits to detect their soldiers.</p>
<p>As mentioned before, an SQ algorithm is efficient if \(\max(M, 1/\tau) = \mathrm{poly}(n)\).
Hence, we can show hardness results in the SQ model by showing that any algorithm that solves a problem (in our case, the problem of agnostically learning intersections of halfspaces under Gaussian marginals) requires that either \(M\) or \(1 / \tau\) grows super-polynomially in \(n\).</p>
<p>The SQ model is particularly useful for hardness results because it’s easier to prove limitations of time and sample complexity than in the sample-based model.
Consider the above construction, where the sample-based model is used to implement an SQ algorithm.
Because the algorithm simulates every query, if \(M\) is exponential, then the time complexity of the sample-based algorithm must also be exponential.
Likewise, if \(1 / \tau\) is exponential, then the number of samples \(m_i\) needed for each query is exponential, which corresponds to exponential sample complexity.
Thus, hardness results in the SQ model roughly imply that there won’t exist any other “reasonable” learning algorithm that solves the problem with polynomial time or samples.</p>
<p>This is nice because it’s typically very hard to prove lower bounds on runtime for learning algorithms in the sample-based model.
It’s mathematically simpler to prove <em>information-theoretic bounds</em> on sample complexity for certain cases, but these only apply to very hard learning problems where no algorithm with a polynomial number of samples exists, let alone one with a polynomial runtime.
However, there are conjectured problems with <em>computational-statistical gaps</em>, where the problem can be solved with a polynomial number of samples at a very large time complexity cost.
Proving these time complexity limitations is considerably more difficult in the sample-based model; the main known approach is to use <em>cryptographic hardness</em> assumptions (like the <em>unique shortest vector problem</em> used for intersections of halfspaces in <a href="https://www.cs.utexas.edu/~klivans/crypto-hs.pdf" target="blank">this Klivans and Sherstov paper</a>), but these are complicated to employ and rely on assumptions that are likely but not certainly true.</p>
<p>SQ-based lower bounds provide a good way to suggest that a learning problem will not be able to be learned without a large runtime.
Since the model is not a perfect correspondence to the sample-based model, SQ-hardness is not a guarantee that there will be <em>no</em> efficient algorithm, but it does promise that most nice and robust algorithms will be unable to solve the problem with an efficient runtime.</p>
<blockquote>
<p><strong>Near-Optimal</strong> Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals</p>
</blockquote>
<p>Putting together all of the past snippets, we conclude that the theorem this paper proves will be of the following form:</p>
<p><em><strong>Theorem:</strong> For sufficiently large \(n\) and \(k \leq ???\), any statistical query algorithm with \(M\) queries of tolerance \(\tau\) that agnostically learns \(\mathcal{H}_k\) over samples with marginal distribution \(\mathcal{N}(0, I_n)\) requires either \(M \geq ???\) or \(\tau < ???\).</em></p>
<p>But what should replace the \(???\)’s? To get some intuition, we ask what it means to be “nearly optimal.”</p>
<p>Let’s first start with a <em>positive result</em>, or an algorithm showing that something is possible for this problem.
In 2008, <a href="https://www.cs.cmu.edu/~odonnell/papers/perimeter.pdf" target="blank">Klivans, O’Donnell, and Servedio</a> published a sample-based algorithm that agnostically learns \(k\) intersections of halfspaces in \(n\) dimensions with \(n^{O(\log k)}\) time and sample complexity.
We’ll refer to this as “KOS08.”
In the talk, I made the three of them former redcoats from the Seven Years War who developed British espionage techniques for use against their French enemies.</p>
<p><img src="/assets/images/2022-07-01-hssv22/kos.png" alt="" /></p>
<p>Critically, their algorithm can be implemented in the SQ model with \(M = n^{O(\log k)}\) and \(\tau = n^{-O(\log k)}\), which means the SQ lower bounds to be discussed limit improvement on this algorithm.</p>
<p>How does their approach work? The proceed in roughly three steps:</p>
<ol>
<li>They show that any intersection of \(k\) halfspaces has a Gaussian surface area of at most \(O(\sqrt{\log k})\). (Because an intersection of \(k\)-halfspaces is a convex set, we can think of the function as an \(n\)-dimensional polytope with an \((n-1)\)-dimensional surface. The <em>Gaussian surface area</em> weights the surface according to the multivariate Gaussian probability distribution; surfaces closer to the origin then represent more “area” than surfaces further away.)</li>
<li>They show that any Boolean function \(f: \mathbb{R}^n \to \{\pm 1\}\) with Gaussian surface area at most \(s\) can be \(L_1\)-approximated by a polynomial \(p\) of degree \(O(s^2)\). (That is,
\(\|f - p\|_2 = \mathbb{E}_{x \sim \mathcal{N}(0, 1)}[|f(x) - p(x)|] \leq 0.01\).)</li>
<li>The actual algorithm consists of performing \(L_1\) polynomial regression, which finds the \(d\)-degree polynomial that best fits the training data with \(n^{O(d^2)}\) samples and time.</li>
</ol>
<p>Hence, an intersection of \(k\) halfspaces can be approximated by a polynomial of degree \(O(\log(k))\), so \(L_1\) polynomial regression with \(n^{O(\log k)}\) samples will find a target function that satisfies the agnostic learning problem.</p>
<p>In addition, there is a known lower bound that was presented at COLT 2021 by Diakonikolas, Kane, Pittas, and Zarifis (<a href="https://arxiv.org/pdf/2102.04401.pdf" target="blank">DKPZ21</a>), which shows that any SQ algorithm learning this problem for \(k = O(n^{0.1})\) requires at either \(M \geq 2^{n^{0.1}}\) or \(\tau \leq n^{\tilde\Omega(\sqrt{\log k})}\).
In the talk, these were French soldiers, who helped the Americans fight the British and put their understanding of British espionage to good use.</p>
<p><img src="/assets/images/2022-07-01-hssv22/dkpz.png" alt="" /></p>
<p>How did they do it? We’ll discuss this more later on, but here are the brief steps:</p>
<ol>
<li>[Theorem 3.5] They show that some <strong>\(k\)-dimensional</strong> intersection of \(k\) halfspaces \(f:\mathbb{R}^k \to \{\pm1\}\) cannot be weakly approximated by any polynomial of degree \(O(\sqrt{\log k})\), which makes it <strong>approximately resilient</strong>.</li>
<li>[Theorem 1.4] They consider different projections of that function from \(n\)-dimensional space, \(F_W(x) = f(W x)\) for \(W \in \mathbb{R}^{k \times n}\) with orthonormal rows.
Two randomly selected such matrices \(W_1\) and \(W_2\) will yield \(F_{W_1}\) and \(F_{W_2}\) that are <em>nearly orthogonal</em> (see <a href="/2021/07/16/orthogonality.html" target="_blank">my notes on orthogonality</a> for an overview), i.e. \(\mathbb{E}_{x \sim \mathcal{N}(0, I_n)}[F_{W_1}(x) F_{W_2}(x)] \approx 0\).
Then, there exists a collection of \(n\)-dimensional functions with a high <em>SQ dimension</em> (<a href="http://vtaly.net/papers/Kearns93-2017.pdf" target="blank">reference</a>, pg 4). By standard results about SQ dimension, these functions are hard for any SQ algorithm to distinguish without either many queries or extremely accurate queries.</li>
</ol>
<p>By comparing the DKPZ21 result with the KOS08 result, one can observe a substantial gap between the two with respect to \(k\): KOS08 asserts that it’s possible to learn intersections of \(k\)-halfspaces with tolerance \(n^{-O(\log k)}\), and DKPZ21 asserts that it’s possible as long as the tolerance is at most \(n^{-\tilde\Omega(\sqrt{\log k})}\).
This leaves open the question: What is the correct tolerance? Can the algorithm of KOS08 be improved to yield a better one that doesn’t require queries to be quite so accurate (and the resulting sample-based algorithm to require fewer samples)? Or is there a stronger lower bound than DKPZ21’s that indicates that the KOS08 algorithm is indeed optimal?</p>
<p><img src="/assets/images/2022-07-01-hssv22/sep1.jpeg" alt="" /></p>
<h2 id="so-what-do-we-actually-do">So what do we actually do?</h2>
<p>As one expects by looking at the title of our paper, we give a stronger lower bound and prove the near-optimality of the KOS08 algorithm.</p>
<p><em><strong>Theorem:</strong> For sufficiently large \(n\) and \(k \leq 2^{O(n^{0.24})}\), any statistical query algorithm with \(M\) queries of tolerance \(\tau\) that agnostically learns \(\mathcal{H}_k\) over samples with marginal distribution \(\mathcal{N}(0, I_n)\) requires either \(M \geq 2^{\Omega(n^{0.1})}\) or \(\tau \leq n^{-\tilde\Omega(\log k)}\).</em></p>
<p>This result is almost identical to that of DKPZ21, save for the different dependence on \(\tau\).
As a result, the tolerance of the KOS08 algorithm is optimal up to \(\log\log k\) factors in the exponent.</p>
<p><img src="/assets/images/2022-07-01-hssv22/sep2.jpeg" alt="" /></p>
<p>A few notes and caveats:</p>
<ul>
<li>The similarity to the theorem of DKPZ21 is no coincidence; we use nearly the same method as they do, except we have a stronger bound on the approximate resilience of some intersection of \(O(k)\) halfspaces.</li>
<li>Our results are actually the combination of two theorems: one which gives an explicit construction of the “hard” intersection of halfspaces to learn but requires \(k \leq O(n^{0.49})\) and the other with a randomized construction. We’ll focus primarily on the former for the rest of this blog post… but you should read the paper to learn about how we adapted a construction from a COLT 2021 paper by <a href="http://proceedings.mlr.press/v134/de21a/de21a.pdf" target="blank">De and Servedio</a> for the second!</li>
<li>We don’t include the dependence of the target accuracy \(\epsilon\) in the above theorem, but it can be added by using a slight modification to the family of intersections of halfspaces that we consider. This involves a result by a Ganzburg paper from 2002 that establishes the \(L^1\) polynomial approximation properties of a single halfspace.</li>
</ul>
<h3 id="okay-but-who-actually-cares-about-this">Okay, but who actually cares about this?</h3>
<p>So yeah, this is a pretty abstract and theoretical result that is very far from neural networks or modern ML practice.
But there are a few nice things about it that ideally should make this interesting to theoreticians, practitioners, and historians alike. (Okay, mostly just theoreticians.)</p>
<ul>
<li>As mentioned before, there’s a connection to this <em>credit allocation problem</em>, where a large number of simple factors may individually be responsible for informing us about the outcome of the prediction.
For instance, one might be denied a loan for failing one of many risk factors without being notified of the correct reason why.
If one wanted to learn the model purely from labels—a collection of acceptances and rejections without rationales—this paper suggests that the problem is very hard if there may be a large number of aggregated factors, even if the factors are linear thresholds and the data follows a simple Gaussian distribution.</li>
<li>In general, proofs of optimality are nice because they tell us (1) that it’s not worth it investing further intellectual resources in trying to improve a solution and (2) there aren’t “hard instances” for an algorithm to handle that aren’t being handled well by the current algorithm.</li>
<li>Our proof techniques (which you’ll see shortly) seem to be an approach that isn’t widely used in this space, which uses functional analysis techniques to gradually transform one function into a similar one that has certain desirable properties.</li>
</ul>
<h2 id="how-do-we-prove-the-result">How do we prove the result?</h2>
<p>To discuss how our proof works, we compare the analysis of DKPZ21 to that of a paper that does something similar by Dachman-Soled, Feldman, Tang, Wan, and Wimmer from 2014 (<a href="https://arxiv.org/pdf/1405.5268.pdf" target="blank">DFTWW14</a>).
In the talk, these folks were expert Prussian soldiers, like Baron von Steuben, who trained the fledgling American army and passed along modern techniques in war/boolean function analysis.</p>
<p><img src="/assets/images/2022-07-01-hssv22/prussia.png" alt="" /></p>
<p>Like DKPZ21 and HSSV22 (our paper) they prove a lower bound against agnostically learning a family of functions in the SQ model.
Unlike us, they learn functions over the boolean cube, of the form \(\{\pm 1\}^n \to \{\pm1\}\).
And instead of showing that the family of rotations of a single function \(f\) are hard to learn (\(\mathcal{H} = \{f(W x): W \in \mathbb{R}^{k \times d}\}\)) they consider <em>juntas</em> of \(f\), or functions that act on only a subset of the variables (\(\mathcal{H} = \{f(x_S): S \subset [n], |S| = k\}\), where \(x_S = (x_{s_1}, x_{s_2}, \dots, x_{s_k})\) for \(S = \{s_1, s_2, \dots, s_k\}\)).</p>
<p><em>Note: Our work fits into a cottage industry of results that translate results that apply to monotone functions on the boolean cube to convex sets in Gaussian space. There’s a pretty sophisticated analogy between the two that is summarized well in the intro of this <a href="https://arxiv.org/pdf/2109.03107.pdf" target="blank">“Convex Influencers” paper</a> by Rocco, Anindya, and Shivam Nadmipalli (<a href="https://www.instagram.com/bagel_simp/" target="blank">bagel_simp</a>).</em></p>
<p>Their result proceeds in analogous steps to that of DKPZ21:</p>
<ol>
<li>[Theorem 1.6] They show that a particular \(k\)-dimensional boolean function \(f: \{\pm1\}^k \to \{\pm1\}\)—in this case, a read-once DNF called \(\mathsf{Tribes}\), an OR-of-ANDs with no repeated variables in clauses, like \(f(x) = (x_1 \wedge x_2) \vee (x_3 \wedge \neg x_5) \vee x_7\)—cannot be approximated accurately by any polynomial of degree \(\tilde{O}(\log k)\), or that \(f\) is approximately \(\tilde{O}(\log k)\)-resilient.</li>
<li>[Theorem 2.3] They consider the family of juntas of that function \(\mathcal{H} = \{f(x_S): S \subset [n], \lvert S\rvert = k\}\) and show that this family contains a larger number of nearly orthonormal functions. This means the class has a high SQ dimension and is hence hard to learn without SQ queries of tolerance \(n^{-\tilde\Omega(\log k)}\).</li>
</ol>
<p>If we contrast these steps with those of DKPZ21, a few things stand out:</p>
<ul>
<li>The approximate resilience statement in (1) of DFTWW14 indicates that Tribes is inapproximable with higher-degree polynomials than (1) of DKPZ21 suggests that the intersection of halfspaces is. One way to improve DKPZ21 is to prove that their function is even harder to approximate with polynomials than they proved.</li>
<li>(2) of DKPZ21 is a more involved proof due to the continuity of the Gaussian setting, which involves some tricky maneuvers like the use of an infinite linear program.</li>
</ul>
<p>Our result works by picking and choosing the best of each: we draw inspiration from the methods of DFTWW14 to improve the resilience bound of DKPZ21 to \(\Omega(\log k)\), while using (2) of DKPZ21 right out of the box.
For the rest of this proof description, I’ll outline the basics of how we did that by defining approximate resilience formally, introducing our target intersection of halfspaces \(f\), and showing how we prove that \(f\) is approximately resilient.</p>
<h3 id="what-is-approximate-resilience">What is approximate resilience?</h3>
<p>A function \(f\) is <strong>approximately \(d\)-resilient</strong> if it is similar to another bounded function \(g\) that is orthogonal to all polynomials of degree at most \(d\). Put more concretely:</p>
<p><em><strong>Definition:</strong> \(f: \mathbb{R}^k \to \{\pm1\}\) is \(\alpha\)-approximately \(d\)-resilient if there exists some \(g: \mathbb{R}^k \to [-1, 1]\) such that \(\|f - g\|_1 = \mathbb{E}_{x \sim \mathcal{N}(0, I_k)}[\lvert f(x) - g(x)\rvert] \leq \alpha\) and \(\langle g, p\rangle = \mathbb{E}[g(x) p(x)] = 0\) for any polynomial \(p\) of degree at most \(d\).</em></p>
<p>For simplicity, we’ll consider the case where \(\alpha = 0.01\) for the remainder of this post.</p>
<p>Intuitively, one can think of \(f\) being approximately resilient to a degree \(d\) if no \(d\)-degree polynomial can correlate with it by more than a negligible amount.
This connection is made formal in Proposition 2.1 of DKPZ21 and is critical for our results.</p>
<p>The definition suggests a relatively simple way of proving the approximate resilience of a function \(f\): Construct some \(g\) that is bounded, well-approximates \(f\), and is completely uncorrelated with all low-degree polynomials.</p>
<h3 id="which-function-do-we-consider">Which function do we consider?</h3>
<p>Our argument considers a specific intersection of \(O(k)\)-halfspaces over \(\mathbb{R}^k\), shows that it’s approximately resilient, and concludes by (2) of DKPZ21 that rotations of this function in \(\mathbb{R}^n\) comprise a family of functions that are hard to learn/hard to distinguish.
The particular function we consider is the <em>cube</em> function \(\mathsf{Cube}(x) = \mathrm{sign}(\theta - \max_{i \in [k]} \lvert x_i\rvert)\).
Put simply this denotes a hypercube of width \(2\theta\) centered on the origin; the function evaluates to 1 inside the cube and -1 outside.
This cube can be written as an intersection of \(2k\) different axis-aligned halfspaces.</p>
<p><img src="/assets/images/2022-07-01-hssv22/cube.jpeg" alt="" /></p>
<p>What is \(\theta\)? We set \(\theta\) to ensure that \(\mathbb{E}[\mathsf{Cube}(x)] = 0\), which ends up meaning that \(\theta = \Theta(\sqrt{\log k})\).
(Why do we need this expectation to be zero? If it were not close to zero, then \(\mathsf{Cube}\) could be weakly approximated by some constant function, which would immediately make it impossible for the function to be approximately \(d\)-resilient for any \(d\).)</p>
<h3 id="how-do-we-establish-the-approximate-resilience-of-cube">How do we establish the approximate resilience of Cube?</h3>
<p>In this part of the post, let \(f^{\leq d}\) represent the components of \(f\) that are correlated with polynomials of degree at most \(d\), and \(f^{> d}\) represent the rest, so \(f(x) = f^{\leq d}(x) + f^{> d}(x)\).</p>
<p>The approximate resilience bound happens in two stages:</p>
<ul>
<li>[Lemma 9] We show that \(\mathsf{Cube}\) has low degree Hermite coefficients. Put similarly, the total correlation between the function and all low-degree polynomials is small. We concretely show that \(\|\mathsf{Cube}^{\leq d}\|^2 \leq \frac{d}{k} O(\log k)^d\). If we let \(d = c \log(k) / \log\log k\) for sufficiently small \(c\), then \(\|\mathsf{Cube}^{\leq d}\|^2 \leq \frac{1}{k^{0.99}}\).</li>
<li>[Lemma 10] We show that any \(f\) with small low-degree Hermite coefficients is approximately \(d\)-resilient. Concretely, if \(\|f^{\leq d}\|^2 \leq \frac{1}{k^{0.99}}\), then \(f\) is approximately \(\Omega(\log(d) / \log\log d)\)-resilient.</li>
</ul>
<p>Put together, the two immediately give us a resilience guarantee for \(\mathsf{Cube}\) that mirrors that of \(\mathsf{Tribes}\) from (1) of DFTWW14. This then provides the desired result from (2) of DKPZ21.</p>
<p><img src="/assets/images/2022-07-01-hssv22/tree.jpeg" alt="" /></p>
<p>The proof of Lemma 9 involves some meticulous analysis of the polynomial coefficients of \(\mathsf{Cube}\), courtesy mainly of Daniel. I’ll refer you to the paper, but this part rests on (1) considering \(\mathsf{Cube}\) as a product of \(k\) interval functions, (2) exactly computing the Hermite coefficients of each interval, and (3) bounding the combined coefficients by some dense summations and applications of Stirling’s inequality.</p>
<p>The proof of Lemma 10 has a little bit of intuition that can likely be conveyed here.
We can think about the problem from the lens of function approximation: Can we use \(f\) to create some function \(g\) that is bounded, approximates \(f\), and is uncorrelated with all low-degree polynomials?
We’ll make several attempts to convert \(f\) to some \(g\).</p>
<p><img src="/assets/images/2022-07-01-hssv22/p1.jpeg" alt="" /></p>
<h4 id="attempt-1-drop-low-degree-polynomials">Attempt #1: Drop low-degree polynomials</h4>
<p>Let \(g := f^{> d}\). Because \(f\) has small low-degree coefficients, \(g\) closely approximates \(f\). By definition, \(g\) is orthogonal to all polynomials of degree at most \(d\). Yay! But, \(g\) is not a bounded function; as the below image indicates, rotating \(f\) to remove its correlation with a linear function causes \(f\) to approach \(\infty\) (or \(-\infty\)) away from the origin.</p>
<p><img src="/assets/images/2022-07-01-hssv22/p2.jpeg" alt="" /></p>
<h4 id="attempt-2-drop-and-threshold">Attempt #2: Drop and threshold</h4>
<p>After dropping the low-degree terms, we can re-impose boundedness by setting the function to zero whenever it grows too large: \(g(x) = f^{>d}(x) 1\{\lvert f^{\leq d}(x)\rvert \leq \eta\}\), for some threshold \(\eta > 0\).
If \(\eta\) is large, then \(g\) is more similar to \(f\), but could take much larger values.
If \(\eta\) is small, then \(g\) cannot be guaranteed to be a good approximation of \(f\), despite its boundedness.</p>
<p>What’s the problem? \(f^{>d}\) may be orthogonal to low-degree polynomials, but multiplying it by this threshold may kill that orthogonality. We gained boundedness, but we lost orthogonality.</p>
<p><img src="/assets/images/2022-07-01-hssv22/p3.jpeg" alt="" /></p>
<h4 id="attempt-3-drop-and-threshold-and-drop">Attempt #3: Drop and threshold and drop</h4>
<p>What if we just dropped the low-degree terms once again? \(g = [f^{>d} 1\{\lvert f^{\leq d}\rvert \leq \eta\}]^{> d}\).
We can use this to restore orthogonality to the previous function.
This is precisely what DFTWW14 uses, since they can get a reasonable bound on the maximum value of \(g\), since it’s supported on a finite domain.
However, we’re once again faced with the issue that \(g\) loses its boundedness by this additional dropping.</p>
<p><img src="/assets/images/2022-07-01-hssv22/p4.jpeg" alt="" /></p>
<h4 id="attempt-4-drop-and-thresholdinfty">Attempt #4: (Drop and threshold)\(^\infty\)</h4>
<p>The previous attempts indicate that if we keep dropping and thresholding the function for carefully chosen thresholds \(\eta\), then we’ll gradually approach an idealized \(g\) that satisfies all of the desired conditions: boundedness, orthogonality, and similarity to \(f\). We do so by defining \(f_0 := f\) and \(f_{i+1} = f_i^{>d} 1\{\lvert f_i^{\leq d}\rvert \leq \eta_i\}\), and letting \(g = \lim_{i \to \infty} f_i\).</p>
<p><img src="/assets/images/2022-07-01-hssv22/p5.jpeg" alt="" /></p>
<p>This proof is delicate for 2 reasons:</p>
<ul>
<li>We need to be careful with our choices of \(\eta_i\), which decrease as \(i\) grows. If they decay too rapidly, then \(g\) might be a very poor approximation of \(f\). Allow \(\eta_i\) to decay too slowly and \(g\) may not end up being bounded.
We had to try out a wide range of schedules of \(\eta_i\) before finally finding the right one.</li>
<li>Limits over functions can be tricky, and we ran into several issues when we weren’t precise enough. Fortunately, Manolis is very skilled with this kind of math and we figured out the right way to formalize it.</li>
</ul>
<p>With this, we got everything we needed: a function \(g\) that validates that \(f\) is approximately resilient as long as its low-degree Hermite coefficients are small.
Thus, \(\mathsf{Cube}\) is approximately \(\tilde\Omega(\log k)\)-resilient, and \(\mathcal{H}_k\) cannot be learned with queries of worse tolerance than \(n^{-\tilde\Omega(\log k)}\).
This allows us to conclude the optimality of Rocco’s 14-year old algorithm about learning intersections of halfspaces.</p>
<p><em>Thank you for making it through this monster of a blog post! (Or for scrolling to the bottom of the page.) I really do hope to write more of these, and as always, I’d love any feedback, questions, or ideas.</em></p>Clayton SanfordWhoops, six months just went by without a blog post. The first half of 2022 has been busy, and I hope to catch up on writing about some of the other theory papers I’ve worked on, the climate modeling project I’m currently involved in at the Allen Institute for AI, random outdoor adventures and hikes in Seattle, and some general reflections on finishing my third year and having a “mid-PhD crisis.” For now though, here’s a summary of a paper that appeared at COLT (Conference on Learning Theory) a couple weeks ago in London. This is similar to my two previous summary posts, which are meant to break down papers I’ve helped write into more easily-digestible chunks. As always, questions and feedback are greatly appreciated.Books of 20212022-01-03T00:00:00+00:002022-01-03T00:00:00+00:00http://blog.claytonsanford.com/2022/01/03/books<p>Happy new year!
One of my primary goals of 2021 was to create this blog, which I actually managed to achieve.
In 2022, I intend to continue writing on this blog and to also post content that can be read by anyone, not just people who do research on machine learning theory.
As a first attempt to write something that’ll appeal to non-computer scientists, here’s a quick post listing and commenting on the non-technical books I read last year.</p>
<p>Stars [*] indicate a book was chosen by a book club.
I thoroughly enjoyed nearly every book on the list, but those with dagger [\(^\dagger\)] marcations are those I’d particularly recommend.
If you have any thoughts, questions, or book recommendations, feel free to comment below or email me.</p>
<p>For the early books on the list, I read them months ago, so some of my comments are a bit rusty.</p>
<ol>
<li><a href="#diamond-age-neal-stephenson"><em>Diamond Age</em>, Neal Stephenson</a></li>
<li><a href="#antifragile-nassim-nicholas-taleb"><em>Antifragile</em>, Nassim Nicholas Taleb</a></li>
<li><a href="#the-death-and-life-of-great-american-cities-jane-jacobs"><em>The Death and Life of Great American Cities</em>, Jane Jacobs</a></li>
<li><a href="#the-new-me-halle-butler"><em>The New Me</em>*, Halle Butler</a></li>
<li><a href="#the-midnight-library-matt-haig"><em>The Midnight Library</em>*, Matt Haig</a></li>
<li><a href="#pachinkodagger-min-jin-lee"><em>Pachinko</em>\(^\dagger\), Min Jin Lee</a></li>
<li><a href="#a-burning-megha-majumdar"><em>A Burning</em>, Megha Majumdar</a></li>
<li><a href="#new-york-2140-kim-stanley-robinson"><em>New York 2140</em>*, Kim Stanley Robinson</a></li>
<li><a href="#being-mortaldagger-atul-gawande"><em>Being Mortal</em>\(^\dagger\), Atul Gawande</a></li>
<li><a href="#the-man-who-mistook-his-wife-for-a-hat-oliver-sacks"><em>The Man Who Mistook His Wife for a Hat</em>*, Oliver Sacks</a></li>
<li><a href="#guns-germs-and-steel-jared-diamond"><em>Guns, Germs, and Steel</em>, Jared Diamond</a></li>
<li><a href="#drive-your-plow-over-the-bones-of-the-deaddagger-olga-tokarczuk-reread"><em>Drive Your Plow over the Bones of the Dead</em>\(^\dagger\), Olga Tokarczuk (reread) </a></li>
<li><a href="#a-brief-history-of-seven-killings-marlon-james"><em>A Brief History of Seven Killings</em>*, Marlon James</a></li>
<li><a href="#red-white-and-royal-blue-casey-mcquiston"><em>Red, White, and Royal Blue</em>, Casey McQuiston</a></li>
<li><a href="#the-smallest-light-in-the-universedagger-sarah-seager"><em>The Smallest Light in the Universe</em>\(^\dagger\), Sarah Seager </a></li>
<li><a href="#exhalationdagger-ted-chiang"><em>Exhalation</em>*\(^\dagger\), Ted Chiang</a></li>
<li><a href="#fun-homedagger-alison-bechdel"><em>Fun Home</em>*\(^\dagger\), Alison Bechdel</a></li>
<li><a href="#the-vegetarian-han-kang"><em>The Vegetarian</em>*, Han Kang</a></li>
<li><a href="#the-plague-albert-camus"><em>The Plague</em>, Albert Camus</a></li>
<li><a href="#the-remains-of-the-daydagger-kazuo-ishiguro"><em>The Remains of the Day</em>\(^\dagger\), Kazuo Ishiguro</a></li>
<li><a href="#conversations-with-friends-sally-rooney"><em>Conversations with Friends</em>, Sally Rooney</a></li>
</ol>
<h2 id="diamond-age-neal-stephenson"><em>Diamond Age</em>, Neal Stephenson</h2>
<p><img src="/assets/images/2022-01-03-books/diamond.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 499 pages, published in 2000.</p>
<p>Recommended by a friend, <em>Diamond Age</em> is a sci-fi novel about a future world whose society with advanced nano-technology is fragmented into discrete warring cultures within cities.
His world building is fascinating; he explores the implications of ubiquitous nano-technology, where every household has a “matter compiler” that creates household objects as needed from “the Feed” of molecules distributed from a central source.
Stephenson’s world is sharply divided by global “phyles,” global cultures that are sovereign in certain districts of different cities across the world and impose their own sets of rules.
Some of the main characters belong to a “Neo-Victorian” phyle that controls a section of Shanghai, alongside the nearby “Han” and “Nippon” phyles, and the book comments on losses in translation between cultures and the benefits and challenges of more centralized and decentralized ways of organizing societies. (He also follows a hierarchical faction of hackers that aim to restructure the world as is.)</p>
<p>The book follows several plotlines and a collection of loosely-connected characters.
My favorite chapters were those that followed Nell (a young girl from a very poor background) as she interacts with the <em>Young Lady’s Illustrated Primer</em>, an interactive story-book that adapts to her real-world challenges to teach her independent thinking in a world where most people have become extremely passive.
There are plenty of strange asides and tangents (and a few plotlines at the end of the book involve unnecessary sexual violence), and some of the characters are less interesting than the ideas they represent; nonetheless, I found it to be a great read.</p>
<!-- There are a ton of ideas here, and I'd love to reread the book to catch more details about the world he creates.
At times, the ideas Stephenson discusses are more interesting than the characters he creates, but it's still a great read.
-->
<p><a href="#">[top of page]</a></p>
<h2 id="antifragile-nassim-nicholas-taleb"><em>Antifragile</em>, Nassim Nicholas Taleb</h2>
<p><img src="/assets/images/2022-01-03-books/antifragile.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 519 pages, published in 2012.</p>
<p>I read <em>The Black Swan</em> by Taleb in 2020, which focuses on the idea that we (academics, world leaders, everyday people) repeatedly fail to consider “black swan” events, low-probability and high-significance events (like a terrorist attack or a pandemic) are difficult to model, yet occur frequently enough to frequently catch us unprepared.
He excoriates and ridicules so-called experts caught unaware by unexpected disasters because their models assume that risks roughly follow a <a href="https://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a>, where deviation from some mean value is uncommon; such models are good for considering natural phenomena (like heights of people and the number of people who die from heart disease every year), but perform terribly for phenomena where the interesting behavior exists far from the mean (such as the winner-take-all dynamics of wealth accumulation and the death tolls of wars).</p>
<p><em>Antifragile</em> is his follow-up on <em>The Black Swan</em>, which focuses less on diagnosing the problem and more on constructing strategies that best handle these rare events.
He argues that systems (e.g. investment strategies, research agendas, foreign policies, health regimens) should be designed to not only handle uncertainty but grow stronger in the face of rare events and volatility.
They do so by designing responses that are nonlinear; for instance, he suggests an “barbell” investment strategy that invests most of a portfolio in extremely stable investments (think: bonds) and a small amount in ventures with low-probabilities of astronomical success.
Doing so creates a “convex” response that bounds the amount of harm one suffers from failure while being open to extreme amounts of success. (Notably, Taleb has employed such strategies to great success, making lots of money by betting against the housing bubble that preceded the Great Recession and all the “suckers” who believed in it.)</p>
<p>Taleb regards antifragility as a broad life philosophy that extends well beyond just investment, frequently invoking classical scholars and focusing on a wide range of applications.
He ruthlessly criticizes those whose value systems are fragile to uncertainty, especially those who lack skin in the game and who are not the ones directly harmed by the failures of their models.
He reserves intense criticism for academics, most of whose research he deems at best ineffectual (since he believes innovation to come from tinkering and dealing with uncertainty directly) and at worst evil (from being held unaccountable when their models are deployed at a large scale and fail to work).
He made points that on occasion made me feel directly under attack.</p>
<p>Personally, I think his pugilistic style worked much better in <em>The Black Swan</em>, where his aim is to criticize, than in <em>Antifragile</em>, where his goals are more constructive.
I found his antifragility framework compelling in the areas where he has expertise (such as investment), but less so when he discusses topics like nutrition and politics, where his expertise is less apparent and he relies more heavily on ad hominem attacks on those who disagree with him.
His frequent assumptions of bad faith and incompetence grated on me intensely; at the same time, I certainly grew as a thinker by reading both of his books, and I indend to read more of his works later on.</p>
<p>If you can stand Taleb’s writing style, I’d certainly recommend the book.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-death-and-life-of-great-american-cities-jane-jacobs"><em>The Death and Life of Great American Cities</em>, Jane Jacobs</h2>
<p><img src="/assets/images/2022-01-03-books/death.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 458 pages, published in 1961.</p>
<p>As a relatively new New Yorker who geeks out on trains, loves walking long distances around the city, and has opinions on housing and density in cities, Jane Jacobs was a must-read, and I really enjoyed her polemic on urban planning.
Set in <a href="https://en.wikipedia.org/wiki/Robert_Moses">Robert Moses’s</a> NYC, where neighborhoods were routinely razed without the consent of their inhabitants to build freeways, Jacobs argues that centralized urban planning fails to create lively and safe neighborhoods and argues for mixed-use development and public input.
She argues that cities and neighborhoods are dynamic and a lack of respect for their social fabric (by, say, replacing a dense immigrant neighborhood with an active street life and a mix of apartments and shops with a sterile apartment building surrounded by lawn) harms residents and raises crime.
She criticizes central planners like Moses for failing to understand the dynamics of the neighborhoods they interrupt, and suggests that mixed-use development–where houses, storefronts, schools, and restaurants coexist–keep neighborhoods safe by having varieties of people passing through at different times of the day for different purposes.</p>
<p>It was particularly fun to listen to this as an audiobook while walking around the city. She frequently singles out particular blocks, parks, and neighborhoods in New York for praise or criticism, and often they’re similar enough 60 years later for me to see her thoughts with my own eyes.
Her arguments on mixed-use development are very prescient, and many modern developments in NYC and other cities obey the principles she lays out.
At the same time, her arguments for local control and for the voices of neighbors to be accounted for arguably have been too successful; many of the key limitations faced by cities when planning are vocal neighborhood groups that weaponize hearings and public comment windows that prevent cities from building projects that improve public transportation and increase affordability.</p>
<p>I thought there was an interesting overlap between Jacobs and Taleb’s books.
Jacobs’ arguments against projects that serve only a single function and splinter neighborhoods can be framed as a criticism of Moses-style urban planning as being too “fragile.”
Her advocacy for multi-use projects claims that having housing and schools and stores and restaurants in the same neighborhoods prevents any blind-spots in safety that occur when only one function is present; this seems to mesh nicely with Taleb’s strategies for avoiding susceptibility to uncommon events.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-new-me-halle-butler"><em>The New Me</em>*, Halle Butler</h2>
<p><img src="/assets/images/2022-01-03-books/new-me.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2020.</p>
<p>This was the first book club book I read this year, and it’s a bit of an odd one.
It follows Millie, a depressed 30-something Millennial woman living in a city working at a temp job.
She suffers from extreme burnout, has terrible hygiene, and hates her friends.
The book documents the sense of unfulfilled expectations and directionlessness faced by privileged young adults who were told they were special.
Some of the best scenes involved Millie extrapolating how her coworkers live based on a few observations.
It also does a good job documenting the cyclic process of seeking out life changes that will make one happier, succeeding temporarily in making those improvements, and then losing hold of them and sinking back to a depression.
If you can accept that Millie will be exhausting and aggravating at times, it’s a nice read.</p>
<p>I haven’t read <em>My Year of Rest and Relaxation</em>, but apparently this is very similar.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-midnight-library-matt-haig"><em>The Midnight Library</em>*, Matt Haig</h2>
<p><img src="/assets/images/2022-01-03-books/midnight.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 193 pages, published in 2019.</p>
<p>A young woman finds nothing in her life to be going her way–her cat has died; her family is estranged; she is fired; her love life is non-existent; and her dreams of being a rockstar, a glaciologist, or a swimmer are unmet–and she decides to take her own life.
Instead of dying, however, she finds herself in a library where she has the opportunity to visit alternate versions of her life had she made different decisions at different points in her life.
The result is a series of anecdotes of her experiencing different versions of herself and observing what changes and what stays the same.
The book is a little predictable and can be saccharine at times, but it’s a very nice exploration of the main character, and it has a nice ending.</p>
<p>Personally, I was annoyed by the parts of the book that tried to talk about multiverse theory as an explanation for her ability to explore the different timelines; multiverses seems so overused at this point, and I think the author could have just presented with ability to transfer between lifetimes as fact without needing to get all pseudo-scientific. But that’s just a silly thing that bugs me; I overall really liked the book.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="pachinkodagger-min-jin-lee"><em>Pachinko</em>\(^\dagger\), Min Jin Lee</h2>
<p><img src="/assets/images/2022-01-03-books/pachinko.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 496 pages, published in 2017.</p>
<p><em>Pachinko</em> follows four generations of a family of Koreans who immigrate to Japan during World War II over several decades.
The book was historically informative to me; I’d had no idea of the scale of the atrocities Japan committed against Korean people and the extent to the racism ethically-Korean people living in Japan face.
The book primarily follows the life of Sunja, who as a teenager has an affair with a wealthy older man and becomes pregnant, but refuses to go with him when she learns he is married.
She instead marries a poor Christian missionary, who emigrates to Japan, and the novel follows her challenges to live in Japan and the divergent paths of her two sons.
Lee’s characters are very compelling and she tells a great epic about coming of age and power dynamics in a land where one does not belong.
The title refers to a casino game in Japan, whose parlors are often owned and operated by Koreans; pachinko goes on to represent both how Koreans living in Japan can rise in a foreign society, but also the cultural barriers that prevent any real form of equality.</p>
<p>I was completely immersed by this book, and I’d strongly recommend it.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="a-burning-megha-majumdar"><em>A Burning</em>, Megha Majumdar</h2>
<p><img src="/assets/images/2022-01-03-books/burning.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2020.</p>
<p>The novel is set in India amidst rising Hindu nationalism (as there is today under Modi) and follows three characters enmeshed in different ways in a terrorist attack that destroyed a train and killed numerous people.
Jivan is a Muslim girl from a poor background struggling to get ahead who is baselessly framed as an accomplice of the attack.
PT Sir is a P.E. teacher who once taught Jivan and sees a pathway to a more notable life for himself by working in a Hindu nationalist political party that aims to capitalize politically on Jivan’s case.
Lovely is a singer who is a <a href="https://en.wikipedia.org/wiki/Hijra_(South_Asia)">hijra</a> and has proof of Jivan’s innocence.
The book is an interesting exploration of their intersecting stories and what happens when getting ahead requires moral compromises.
I found the book to be a good read; the politics of it were a bit blunt and in-your-face, but the characterizations were good.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="new-york-2140-kim-stanley-robinson"><em>New York 2140</em>*, Kim Stanley Robinson</h2>
<p><img src="/assets/images/2022-01-03-books/new-york.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 624 pages, published in 2017.</p>
<p>The title here is pretty literal; the book imagines New York City in 2140 after being ravaged by climate change and having sea levels rise fifty feet.
After neglecting climate change for years, humans in Robinson’s book are shocked by the rapidness of sea level rise and are forced to rapidly adapt when it happens.
The main characters live in the famous Metlife building in Midtown, which is partially submerged.
People navigate primarily by boat and buildings frequently collapse due to structural damage, while the parts of the city that are above water (like my own neighborhood Morningside Heights) have become enclaves for extremely rich real estate developments.
As with <em>Diamond Age</em> and many other sci-fi books, the concepts developed are often much more interesting than the characters, who often seem like mouthpieces for ideologies rather than three-dimensional people.
However, the concepts explored are fascinating (amphibious people living mostly underwater, the consequence of climate refugees, reality stars trying to save polar bears from extinction), and Robinson seems to have done his homework on the science.</p>
<p>The book’s message is a broader critique of capitalism that goes beyond its failure to slow climate change.
Robinson’s world is one where everything is financialized (a character makes his living betting on the level of sea level rise in different locations), where the division of wealth is even more extreme, where private security who protect assets carry more power than police, and there are very few jobs in the “real” economy but tons of lawyers and investors.
He focuses heavily on the 2008 financial crisis (which loses a bit of immersion, since it seems unrealistic to have characters fixate on something 132 years ago), and the characters work to achieve their goals by working in a tenant’s union and ultimately organizing electorally.
All in all, an interesting (albeit, dense and occasionally dry) book that’s more a critique of finance than of fossil fuels.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="being-mortaldagger-atul-gawande"><em>Being Mortal</em>\(^\dagger\), Atul Gawande</h2>
<p><img src="/assets/images/2022-01-03-books/being.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 282 pages, published in 2014.</p>
<p>Twice, I’d told a friend in medical school that I’d read <a href="https://www.goodreads.com/en/book/show/25899336-when-breath-becomes-air"><em>When Breath Becomes Air</em></a>, and twice, they’d told me that <em>Being Mortal</em> provides a better meditation of medicine and mortality.
I finally decided to read it, and I found it fascinating.
Gawande focuses on the challenges posed by aging and criticizes how many of our life decisions focus on extending life at the cost of independence, comfort, and individuality.
He is particularly critical of how many decisions about aging and death have shifted from being nuanced cultural discussions to purely medical choices focused on a patient’s survival over their own desires; fortunately, he believes the tide to be turning and finds promise in recent changes to elder care and hospice care.
He makes his case by presenting a wide range of anecdotes about aging and dying people alongside well-researched arguments.
While most of the book is not a memoir, the book closes with a very compelling discussion on his own father’s process of dying and the delicate decisions that were made by his father, his family, and his doctors.</p>
<p>Quotes from books rarely stick with me, but one did from this one that encapsulates the challenges we often face with aging and death: “We want autonomy for ourselves, but safety for those we love.”
I appreciated the book because it grapples with the complexity of the questions posed by aging and dying without claiming to have easy answers to them.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-man-who-mistook-his-wife-for-a-hat-oliver-sacks"><em>The Man Who Mistook His Wife for a Hat</em>*, Oliver Sacks</h2>
<p><img src="/assets/images/2022-01-03-books/man.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 243 pages, published in 1985.</p>
<p>The book consists of a series of essays about patients the author has treated as a neurologist.
In one essay, a patient can no longer create new memories, and he remembers none of the previous several decades; his memory of his life during World War II remains sharp, and he does not realize that he and his family have aged.
In another, a musical teacher has visual agnosia and fails to recognize objects, despite having working vision (and hence mistakes his wife for a hat); the essay is about finding a way to focus on the music that brings him joy despite the loss of visual capabilities.
Most of the stories are more about the humanity of those who have these rare conditions, rather than a rigorous medical discussion.</p>
<p>The book club meeting focused a lot on the dissonance we felt while reading the book; for the time period when Sacks published the book, the essays are a far cry from many other works that would ridicule or dehumanize these people.
However, Sacks’s language doesn’t hold up for the modern world, and certain words and descriptions come across as overly harsh.
While his book as written seems crass at times to our group, we recognized that it’s still a strong step forward for its time.</p>
<p>I was particularly affected by one of the later essays, called “The Twins,” which focuses on two autistic twins with extremely sophisticated mathematical intuitions.
The two could memorize huge enormous numbers and factor those numbers in their head.
They often communicated by sharing prime numbers to one another, which conveyed actual messages and emotions.
They appeared to have a deep intuitive understanding of prime numbers that almost everyone else lacks.
As a mathematician (sorta) myself, it made me reflect upon how rudimental my mathematical intuition is, and how poorly trained my brain is to do the job I have.
I can keep maybe 4 or 5 variables in my head at once and visualize no more than three dimensions; these twins he profiles clearly had a richer intuition that could not be explained, and I found myself thinking of how limited my own perception of math is by the way my brain is configured.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="guns-germs-and-steel-jared-diamond"><em>Guns, Germs, and Steel</em>, Jared Diamond</h2>
<p><img src="/assets/images/2022-01-03-books/guns.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 498 pages, published in 1997.</p>
<p>Okay, so this is a “big history” book (like <em>Sapiens</em>) where an expert in one field comes up with a big unifying theory that claims to explain the main trends in history from <em>homo habilis</em> to 2022.
Despite the pitfalls of genre–the oversimplification of complex events to fit a narrative that can be presented in 500 pages–I think the book’s core argument is a good one.
Diamond argues that geography is to the nature and extent of civilizational growth in different regions and helps explain why certain civilizations dominated others historically.
I found the most interesting parts of the book to be his discussions on agriculture: that annual grasses are the easiest to domesticate, that there are relatively few animals that can be domesticated at all, and that the selection of domesticable plants in the Americas and in Australia were too nutrient-poor to make it possible to abandon the hunter-gatherer way of life.
Given Diamond’s background in evolutionary biology, geography, and physiology, I’m most persuaded by his arguments about the role nearby plant and animal species played in the development of early agricultural cultures and their ability to form empires.</p>
<p>Parts of the book seemed rather obvious to me, but perhaps that speaks to its success.
My high school history teachers often assigned this book to classes, so its ideas about geographic influence on history were likely already indirectly incorporated by my teachers into lessons.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="drive-your-plow-over-the-bones-of-the-deaddagger-olga-tokarczuk-reread"><em>Drive Your Plow over the Bones of the Dead</em>\(^\dagger\), Olga Tokarczuk (reread)</h2>
<p><img src="/assets/images/2022-01-03-books/drive.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 318 pages, published in 2009.</p>
<p>This was the first book I read for my book club in 2019, and I revisited it while traveling in Colorado for a conference this summer.
<em>Drive Your Plow</em> follows Janina, an older Polish woman who maintains houses in a small village, cares for wild animals, translates William Blake, studies astrology, teaches English to children, cooks vegetarian food, and builds up an entourage of misfit adults.
She’s a bit of an unreliable narrator who periodically omits key details, but she’s easy to love nonetheless.</p>
<p>The book reflects on militant vegetarianism and the invisibility of older women.
It does a great job at highlighting the anguish one must feel when all others are seemingly blind to what is transparently injust.
The novel defies genre and lurches between cozy moments and grisly murders (a character is eaten alive by beatles).
I spent my first read trying to figure out whether the book was magical realism or not, and the ambiguity made it a more gripping read.
The translation from Polish is also incredible; there’s a particular scene where Janina is attempting to figure out the right Polish phrase for an English passage, which means the translator must have come up with multiple plausible English translations of Polish translations of an English text.</p>
<p>I liked it well enough my first time through, but I thoroughly enjoyed it the second.
Knowing the general contours of the plot made it possible for me to catch more of her biting and witty observations of those around her and to better understand the nature and intensity of Janina’s rage.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="a-brief-history-of-seven-killings-marlon-james"><em>A Brief History of Seven Killings</em>*, Marlon James</h2>
<p><img src="/assets/images/2022-01-03-books/killings.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 688 pages, published in 2014.</p>
<p>Before performing a peace concert in Jamaica in 1976, Bob Marley was shot by Jamaican gangsters in his house.
He survived the gunshots and performed the concert two days later.
The novel is divided into five sections, each of which covering the events of a single day between 1976 and 1991 in Jamaica and NYC from the perspectives of the gangsters involved, a lover of Marley’s, a CIA agent, and an American reporter.</p>
<p>The book is seriously gruesome.
People are beheaded, shot in broad daylight, and die in overdoses.
A man is buried alive, and the chapter is narrated from the perspective of the man buried.
Some of the plotlines meander away from the core plot: one of the gangsters has anonymous sex with men while denying his homosexuality; a Jamaican-born caregiver exchanges raunchy jokes with the old white man she works for.
It’s at times hard to keep track of all of the characters, but it’s a pretty neat tapestry of the far-reaching effects of a single day once you can follow it.
It’s a commitment to read, but the book has fascinating characters and touches on a wide range of themes if you can stomach it.</p>
<p>As someone who didn’t know much of anything about Jamaica prior to reading the book, I found it highly educational, especially with respect to Jamaican slang; you’ll encounter words like “bomboclat” frequently, and I think it’s easier to listen to the audiobook, where most chapters are narrated by people with Jamaican accents.
Also, this book introduced me to <a href="https://en.wikipedia.org/wiki/Griselda_Blanco">Griselda Blanco</a>, who has one of the crazier Wikipedia pages I’ve read.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="red-white-and-royal-blue-casey-mcquiston"><em>Red, White, and Royal Blue</em>, Casey McQuiston</h2>
<p><img src="/assets/images/2022-01-03-books/red.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 421 pages, published in 2019.</p>
<p>After reading 700 pages of Jamaican gangsters murdering people in cold blood, I needed something light and cute to devour and bring my mood back.
Enter <em>Red, White, and Royal Blue</em>, a cute gay romance book about the biracial bisexual son of the first female American president falling in love with a prince of England.
Sure, it was a peak Trump-era liberal what-if fantasy escapism, and sure, it was cheesy and overly cutesy at times, and yeah, some of the politics seemed a little over-simplified at times, but who cares?
It served its purpose, the characters were fun, and it was a nice escape from the brutality of <em>Seven Killings</em> and from the stress of the real world.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-smallest-light-in-the-universedagger-sarah-seager"><em>The Smallest Light in the Universe</em>\(^\dagger\), Sarah Seager</h2>
<p><img src="/assets/images/2022-01-03-books/smallest.jpg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 308 pages, published in 2020.</p>
<p>One of my high school teachers recommended this memoir about an MIT physicist’s search for exoplanets and grief from the death of her husband to cancer.
Both components of the book and their intersections are excellently written.
As someone who knows extremely little about physics, but who’s always been vaguely interested in space, I enjoyed Seager’s descriptions of the science used to evaluate whether an exoplanet could feasibly support life and of the intensity of her drive to pursue that work.</p>
<p>Her discussion of her husband’s death and the grief of Seager and her two young sons was remarkable in its vulnerability and openness.
I found myself gratefully admiring Seager for her resilience, but also for her ability to be honest about times when she was less resilient.
She captures moments that many of us might be too uncomfortable to bear to our peers let alone the whole world: her coping mechanisms, her over-reliance on others to help her complete simple tasks, the internal turmoil of falling in love again, the stresses of traveling as a single parent.
I appreciate that she shared her journey and has the humility, which many other academics lack, to show her shortcomings.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="exhalationdagger-ted-chiang"><em>Exhalation</em>*\(^\dagger\), Ted Chiang</h2>
<p><img src="/assets/images/2022-01-03-books/exhalation.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 352 pages, published in 2019.</p>
<p>I’d been itching to read something by Ted Chiang for a while, and I finally got the chance with a book club book.
<em>Exhalation</em> is a collection of sci-fi short stories that hits on a variety of themes on technology.</p>
<p>In “The Lifecycle of Software Objects,” a woman who is employed by a software company to train and socialize “digients,” AI animals with high capabilities for intelligence.
While reading it, I kept thinking it was going to fall for the standard tropes of books on AI: When will the digients go rogue and kill their handlers? Are they going to suddenly be declared “conscious” by some fuzzy definition?
Instead, the story focuses on the relationships among handlers and between handlers and digients and their adaptation to less global issues: What happens when everyone else wants to migrate to a new digital platform which is incompatible for the digients? Is it acceptable to make copies of digients and use the copies for less savory purposes (i.e. sex work)?</p>
<p>In other stories, he considers machines that provide windows between parallel universes, juxtaposes the development of retinal recordings with the introduction of the written word to an illiterate tribe, explores time travel in a medieval Islamic context (where those who pass through the gates are more accepting of fate and avoid meddling), and makes the case we ignore amazing forms of life at home (like parrots) while aspiring to find extraterrestrial life.
Unlike the two other works of sci-fi on this list, Chiang succeeds in marrying thought-provoking concepts with interesting characters.
I’m looking forward to reading his other anthology, <em>Stories of Your Life and Others.</em></p>
<p><a href="#">[top of page]</a></p>
<h2 id="fun-homedagger-alison-bechdel"><em>Fun Home</em>*\(^\dagger\), Alison Bechdel</h2>
<p><img src="/assets/images/2022-01-03-books/fun.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 240 pages, published in 2006.</p>
<p>In this graphic novel, Alison Bechdel (as in the <a href="https://en.wikipedia.org/wiki/Bechdel_test">Bechdel test</a>) examines her childhood and the nuances of her relationship with her troubled father.
Both Bechdel and her father are gay, but Bechdel’s father remained closeted and had sexual relationships with teenage boys before taking his life not long after she came out to her parents.
She highlights the nuances of their relationships and the generational differences between gay people who came of age in the 50s and the 70s; she wonders whether her father’s fate would have been different if he had been born later and expresses sympathy, without qualifying her criticism of his illicit actions.</p>
<p>I appreciated her thoughtful exploration of her fragmented family and found myself thinking more about the questions she asked about generations. I’d unconsciously mentally grouped grouped LGBTQ people into only two groups: those who were of age during the AIDS epidemic and those who were not; this book made me think more about the intense repression faced by those in my grandparents’ generation.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-vegetarian-han-kang"><em>The Vegetarian</em>*, Han Kang</h2>
<p><img src="/assets/images/2022-01-03-books/vegetarian.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 188 pages, published in 2016.</p>
<p>This might just be the most bizarre novel I’ve ever read.
It starts with a decision by a Korean woman, Yeong-hye, to give up meat and become a vegetarian.
Her husband and father respond to her rebellion in a shockingly harsh manner, given how widespread vegetarianism is.
The first section of the book is narrated by her husband, the second by her brother-in-law, and the third by her sister.
It’s about passivity, it’s about abuse, it’s about a lack of respect for female autonomy.
It’s also about becoming a plant and sex scenes where both participants are covered in painted flowers.
I won’t pretend to claim I fully understood this one, but it’s an interesting exploration of Yoeng-hye’s strange ways of taking control of herself.
It’s really not about vegetarianism at all.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-plague-albert-camus"><em>The Plague</em>, Albert Camus</h2>
<p><img src="/assets/images/2022-01-03-books/plague.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 308 pages, published in 1947.</p>
<p>I waited long enough into the pandemic before reading <em>The Plague</em>, with the hope that I could pick up on its prescience for COVID without it hitting too close to home.
It tells the story of a town in North Africa that is infested with a rat plague and follows Rieux, a doctor who treats infected patients whose commitment to treating patients is strong enough that adopting a fatalist attitude is unimaginable to him.
Much of the book is about individual powerlessness in the face of an epidemic that acts according to its own whims; while Rieux’s choices to treat others are obvious to him, others struggle more to find meaning amidst the suffering and the arbitrariness of the plague’s killing.
Camus is critical of handling forces outside our control (like a plague) by embracing absurdity; rather, one should fight it even against odds of failure.</p>
<p>Naturally, much of it felt very familiar from the perspective of COVID. The journalist’s desperation for an excuse to leave the city rings true for those of us who had voices in our head trying to rationalize why the lockdown ought not apply to us.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-remains-of-the-daydagger-kazuo-ishiguro"><em>The Remains of the Day</em>\(^\dagger\), Kazuo Ishiguro</h2>
<p><img src="/assets/images/2022-01-03-books/remains.jpg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 258 pages, published in 2005.</p>
<p>This may have been my favorite book of the year.
Ishiguro chronicles the reflections of an aging English butler in the 1950s as the social order he spent his life serving decays and his recently-deceased lord’s name is besmirched.
The events of the book take place over a drive in the countryside taken by the butler as he ponders his past and the morality of his beloved Lord Darlington.
Throughout, he grapples with the question of dignity: Does dignity as a butler require serving a virtuous master? And if so, is it a core duty of a great butler to frequently assess the morals of the lord he serves?
His ruminations on what it means to be a great memory and to have dignity causes him to revisit past memories about his father’s own career as a butler, on Lord Darlington’s association with German leaders before WWII, and his relationship with the housekeeper.
His reflections are clear, thoughtful, and at times humorous, and I often found myself jarred by the gap between their clarity and his awkwardness in actual conversation.</p>
<p>I thoroughly enjoyed the book. It was a pleasure to read, and it fueled questions for my own reflection as I think about what dignity means in the context of my own career.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="conversations-with-friends-sally-rooney"><em>Conversations with Friends</em>, Sally Rooney</h2>
<p><img src="/assets/images/2022-01-03-books/conversations.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2017.</p>
<p>Rooney’s novel follows four main characters: Frances, an introspective and sharp writer and college student; Bobbi, her more bombastic and radical friend and former lover; Melissa, a socialite and older journalist who is intrigued by the girls; and Nick, Melissa’s passive husband and an actor.
Frances and Nick’s love affair is the main focus of the plot, but it’s also the only book I’ve read where there’s actually a “love square” with four edges.
The characters were interesting (although irritating at times), and it functioned as a coming-of-age book of sorts for Frances and Bobbi, as they come to better explore the power dynamics of their complex relationships with one another with the older couple.
The malaise and insecurity it detailed was similar in some ways to that of <em>The New Me</em>, without the extreme dysfunction.</p>
<p>Overall, I’d say it wasn’t a perfect fit for me, but I’m still glad to have read it.</p>
<p><a href="#">[top of page]</a></p>
<p>…</p>
<p><em>Thanks for making it all the way down!
This took a while for me to write. (For 2022, I think I’ll try to write these little summaries as I go, so I’ll remember them better, and so I won’t have to do them all in one day.)
Once again, happy new year, and I look forward to having more content soon (ML or otherwise).</em></p>Clayton SanfordHappy new year! One of my primary goals of 2021 was to create this blog, which I actually managed to achieve. In 2022, I intend to continue writing on this blog and to also post content that can be read by anyone, not just people who do research on machine learning theory. As a first attempt to write something that’ll appeal to non-computer scientists, here’s a quick post listing and commenting on the non-technical books I read last year.How do SVMs and least-squares regression behave in high-dimensional settings? (NeurIPS 2021 paper with Navid and Daniel)2021-12-07T00:00:00+00:002021-12-07T00:00:00+00:00http://blog.claytonsanford.com/2021/12/07/ash21<p>Hello, it’s been a few weeks since I finished my candidacy exam, and I’m looking forward to getting back to blogging on a regular basis.
I’m planning on focusing primarily on summarizing others’ works and discussing what I find interesting in the literature, but I periodically want to share my own papers and explain them less formally.
I did this a few months ago for first grad student paper on the approximation capabilities of depth-2 random-bottom-layer neural networks <a href="/2021/08/15/hssv21.html" target="_blank">HSSV21</a>.</p>
<p>This post does the same for <a href="https://proceedings.neurips.cc/paper/2021/hash/26d4b4313a7e5828856bc0791fca39a2-Abstract.html" target="_blank">my second paper</a>, which is on support vector machines (SVMs) and ordinary least-squares regression (OLS) in high-dimensional settings.
I wrote this paper in collaboration with Navid Ardeshir, another third-year PhD student at Columbia studying Statistics, and our advisor, <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>.
It will appears at NeurIPS 2021 this week: a talk recorded by Navid is <a href="https://neurips.cc/virtual/2021/poster/27524" target="_blank">here</a>, our paper reviews are <a href="https://openreview.net/forum?id=9bqxRuRwBlu" target="_blank">here</a>, and our poster will be virtually presented on Thursday 12/9 between 8:30am and 10am Pacific time.</p>
<p>I’d love to talk with anyone about this paper, so if you have any questions, comments, or rants, please comment on this post or send me an email.</p>
<h2 id="what-are-ols-and-svms">What are OLS and SVMs?</h2>
<p>The key result of our paper is that two linear machine learning models coincide in the high-dimensional setting.
That is, when the dimension \(d\) is much larger than that number of samples \(n\), the solutions of the two models on the same samples have the same parameters.
This is notable because the models have different structures and appear at first-glance to incentivize different kinds of solutions.
It’s also perplexing because the models do not seem to be analogous–OLS is a regression learning algorithm and SVM is a classification algorithm.
We’ll briefly explain what the two models are below and what they mean in the high-dimensional setting.</p>
<p>Both of these models were discussed extensively in <a href="/2021/07/04/candidacy-overview.html" target="_blank">my survey</a> on over-parameterized ML models, and I’ll periodically refer back to some of those paper summaries (and occasionally steal visuals from my past self).</p>
<h3 id="ols-regression-and-minimum-norm-interpolation">OLS regression and minimum-norm interpolation</h3>
<p>The task of ordinary least-squares (OLS) regression is simple: find the linear function (or hyperplane) that best fits some data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}\).
To do so, we learn the function \(x \mapsto w_{OLS}^T x\), where \(w_{OLS}\) solves the following optimization problem, minimizing the mean-squared error between training labels \(y_i\) and each prediction \(w_{OLS}^T x_i\):</p>
\[w_{OLS} \in \arg\min_{w \in \mathbb{R}^d} \sum_{i=1}^n (y_i - w_{OLS}^T x_i)^2.\]
<p>For the “classical” learning regime, where \(d \ll n\), \(w_{OLS}\) can be explicitly computed with \(w_{OLS} = X^{\dagger} y = (X^T X)^{-1} X^T y\), where \(X = (x_1, \dots, x_n) \in \mathbb{R}^{n \times d}\) and \(y = (y_1, \dots, y_n) \in \mathbb{R}^d\) collect all of the training inputs and labels into a single matrix and vector, and where \(X^{\dagger}\) is the pseudoinverse of \(X\).
In the event where \(X^T X \in \mathbb{R}^{d \times d}\) is invertible (which is typically true when \(d \ll n\), although it may not be in cases where there is a lot of redundancy in the features and the columns of \(X\) are colinear), the corresponds to the unique minimizer of the above optimization problem.
Intuitively, this corresponds to choosing the linear function that will most closely approximate the labels of the samples, but one that will not likely perfectly fit the data.</p>
<p>As discussed in my blog posts on <a href="/2021/07/05/bhx19.html" target="_blank">BHX19</a> and <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, this works well in the under-parameterized regime, but it’s not directly obvious how one should choose the best parameter vector \(w_{OLS}\) in the over-parameterized regime \(d \gg n\), since there are many parameter vectors that result in zero training error.
These papers choose the vector by considering <em>minimum-norm interpolation</em> as the high-dimensional analogue to OLS. This entails solving the following optimization problem, which relies on \(XX^T \in \mathbb{R}^{n \times n}\) being invertible, which is typically the case when \(d \gg n\):</p>
\[w_{OLS} \in \arg\min_{w \in \mathbb{R}^d} \|w\|_2 \ \text{such that} \ w^T x_i = y_i \ \forall i \in [n].\]
<p>In other words, it chooses the hyperplane with the smallest weight norm (which we can think of as the “smoothest” hyperplane or the hyperplane with smallest slope) that perfectly fits the data. Conveniently, this hyperplane is also found by using the pseudo-inverse of \(X\): \(w_{OLS} = X^{\dagger} y = X^T (X X^T)^{-1} y\).
As a result, we (and numerous papers that consider over-parameterized linear models) consider this minimum-norm interpolation problem to be the high-dimensional version of OLS, which allows OLS to be defined with the same psuedo-inverse solution for all choices of \(n\) and \(d\).</p>
<p>Notably, these other papers show that high-dimensional OLS can have good generalization under certain distributional assumptions, despite the fact that classical generalization bound approaches (like VC-dimension) suggest that models with more parameters than samples are likely to fail.
These results are a big part of the inspiration for this project and motivate the study of high-dimensional linear regression.</p>
<h3 id="support-vector-machines">Support vector machines</h3>
<p>SVMs are a classification problem, rather than a regression problem, which means that a training sample \((x_i, y_i)\) can be thought of as belonging to \(\mathbb{R}^d \times \{-1, 1\}\).
Instead, the goal is to learn a linear classifier of the form \(x \mapsto \text{sign}(w_{SVM}^T x)\) that <em>decisively</em> classifies every training sample.
That is, we want it to be the case that \(w_{SVM}^T x_i\) be bounded away from zero for every \(x_i\).
This follows the same motivation as the generalization bounds on <a href="/2021/10/20/boosting.html" target="_blank">boosting the margin</a>; decisively categorizing each training sample makes it hard for the chosen function to be corrupted by the variance of the training data. It also requires the assumption that the training data are linearly separable.</p>
<p>This high-level goal for a classifier (called the <em>hard-margin SVM</em>) can be encoded as the following optimization problem, which asks that \(w_{SVM}\) be the lowest-magnitude classifier that separates the samples from the decision boundary by distance at least one:</p>
\[w_{SVM} \in \arg\min_{w \in \mathbb{R}^d} \|w\|_2 \ \text{such that} \ y_i w^T x_i \geq 1 \ \forall i \in [n].\]
<p>By stealing an image from my past blog post, we can visualize the classifiers have maximum margin.</p>
<p><img src="/assets/images/2021-10-28-cl20/margin.jpeg" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /></p>
<p>A key features of SVMs is that the classifier can also be defined by a subset of the training samples, the ones that lie exactly on the margin, i.e. have \(w_{SVM}^T x_i = y_i\).
These are called the <em>support vectors</em>.
If \(x_1, \dots, x_k\) are the support vectors of \(w_{SVM}\), then \(w_{SVM} = \sum_{i=1}^k \alpha_i x_i\) for some \(\alpha \in \mathbb{R}^k\). Traditionally, bounds on the generalization powers of SVMs depend on the number of support vectors: fewer support vectors means an intrinsically “simpler” model, which indicates a higher likelihood that the model is robust and generalizes well to new data.</p>
<h3 id="support-vector-proliferation-or-ols--svm">Support vector proliferation, or OLS = SVM</h3>
<p>By looking back at the two optimization problems for high-dimensional OLS and SVMs, the two are actually extremely similar.
In the case where the OLS problem has binary labels \(\{-1, 1\}\), the two are exactly the same, except that the SVM problem has inequality constraints and OLS has equality.
Therefore, in the event that the optimal SVM solution \(w_{SVM}\) satisfies each inequality constraint with equality, then \(w_{SVM} = w_{OLS}\).
Because a constraint is satisfied with equality if and only if the corresponding sample is a support vector, \(w_{SVM} = w_{OLS}\) if and only if every training sample is a support vector.
We call this phenomenon <em>support vector proliferation</em> (SVP) and explore it as the primary goal of our paper.
Our contributions involve studying when SVP occurs and when it does not, which has implications for SVM generalization and the high-dimensional behavior of both models.</p>
<h3 id="why-care-about-svp-and-what-is-known">Why care about SVP and what is known?</h3>
<p>The study of support vector proliferation has previously provided bounds on generalization behavior of high-dimensional (or over-parameterized) SVMs, and our tighter understanding of the phenomenon will make future bounds easier.
In particular, the paper <a href="/2021/11/04/mnsbhs20.html" target="_blank">MNSBHS20</a> (which includes Daniel as an author) bounds the generalization of high-dimensional SVMs by (1) using SVP to relate SVMs to OLS and (2) showing that OLS with binary outputs has favorable generalization guarantees under certain distributional assumptions, similar to those of <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>.
Specifically, they show that SVP occurs roughly when \(d = \Omega(n^{3/2} \log n)\) for the case where the variances of each feature are roughly the same.</p>
<p>This paper does not answer how tight the phenomenon is, leaving open the question of when (as a function of \(d\) and \(n\)) will SVP occur and when will it not.
This question was partially addressed in a follow-up paper, <a href="https://arxiv.org/abs/2009.10670" target="_blank">HMX21</a> by Daniel, Vidya Muthukumar, and Mark Xu.
The show roughly that SVP occurs (for a broad family of data distributions) when \(d = \Omega(n \log n)\) and that it does not occur (for a narrow family of distributions) when \(d = O(n)\), leaving open a logarithmic gap.
Our paper closes this gap and considers a broader family of data distributions.</p>
<p>SVM generalization has also been targeted by <a href="/2021/10/28/cl20.html" target="_blank">CL20</a> and others using approaches that rely not on SVP by on the relationship between SVMs and gradient descent. Specifically, they rely on a fact from <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a>, which shows that gradient descent applied to logistic losses converges to a maximum-margin classifier.
This heightens the relevance of support vector machines, since more sophisticated models may trend towards the solutions of hard-margin SVMs when trained with gradient methods.
Thus, our exploration of SVP and how it relates minimum-norm and maximum-margin models may have insights about the high-dimensional behavior of other learning algorithms that rely on implicit regularization.</p>
<h2 id="what-do-we-prove">What do we prove?</h2>
<p>Before jumping into our results, we introduce our data model and explain what HMX21 already explained in that setting.</p>
<h3 id="data-model">Data model</h3>
<p>We consider two settings each of which have independent random features for each \(x_i\) and fixed labels \(y_i\).</p>
<p><strong>Isotropic Gaussian sample:</strong> For fixed \(y_1, \dots, y_n \in \{-1, 1\}\), each sample \(x_1, \dots, x_n \in \mathbb{R}^d\) is drawn independently from a multivariate spherical (or isotropic or standard) Gaussian \(\mathcal{N}(0, I_d)\).</p>
<p><strong>Anisotropic subgaussian sample:</strong> For fixed \(y_1, \dots, y_n \in \{-1, 1\}\), each sample \(x_i\) is defined to be \(x_i = \Sigma^{1/2} z_i\), where each \(z_i\) is drawn independently from a 1-subgaussian distribution with mean zero and \(\Sigma\) is a diagonal covariance matrix with entries \(\lambda_1 > \dots > \lambda_d\). Hence, \(\mathbb{E}[x_i] = 0\) and \(\mathbb{E}[x_i x_i^T] = \Sigma\).</p>
<p>If the latter model has a Gaussian distribution, then \(\Sigma\) can be permitted to be any positive definite covariance matrix with eigenvalues \(\lambda_1, \dots, \lambda_n\) due to the rotational symmetry of the Gaussian.</p>
<p>We consider the regime \(d \gg n\) in order to ensure that the data are linearly separable with extremely high probability, which is acceptable because the paper is focused on the study of the over-parameterized regime.</p>
<p>The anisotropic data model requires using dimension proxies rather than \(d\) on occasion, because the rapidly decreasing variances could cause the data to have a much smaller effective dimension. (Similar notions are explored in HMX21 and over-parameterization papers like BLLT19.)
We use two notions of effective dimension: \(d_\infty = \frac{\|\lambda\|_1}{\|\lambda\|_\infty}\) and \(d_2 = \frac{\|\lambda\|_1^2}{\|\lambda\|_2^2}\). Note that \(d_\infty \leq d_2 \leq d\).</p>
<h3 id="contributions-of-hmx21">Contributions of HMX21</h3>
<p>HMX21 proves two bounds: an upper bound on the SVP threshold for an anisotropic subgaussian sample and a lower bound on the SVP threshold for an isotropic gaussian sample.</p>
<p><em><strong>Theorem 1</strong> [HMX21]: For an anisotropic subgaussian sample, if \(d_\infty = \Omega(n \log n)\), then SVP occurs with probability at least \(0.9\).</em></p>
<p><em><strong>Theorem 2</strong> [HMX21]: For an isotropic Gaussian sample, if \(d = O(n)\), then SVP occurs with probability at most \(0.1\).</em></p>
<p>This leaves open two obvious technical questions, which we resolve: closure of the \(n\) vs \(n \log n\) gap and generalization of Theorem 2 to handle the anisotropic subgaussian data model. We give these results, and a few others about more precise thresholds, in the next few sections.</p>
<h3 id="result-1-closing-the-gap-for-the-isotropic-gaussian-case">Result #1: Closing the gap for the isotropic Gaussian case</h3>
<p>We close the gap between the two HMX21 bounds by showing that the critical SVP threshold occurs at \(\Theta(n \log n)\).
The following is a simplified version of our Theorem 3, which is presented in full generality in the next section.</p>
<p><em><strong>Theorem 3</strong> [Simplified]: For an isotropic Gaussian sample, if \(d = O(n \log n)\) and \(n\) is sufficiently large, then SVP occurs with probability at most \(0.1\).</em></p>
<p>In the version given in the paper, there is also a \(\delta\) variable to represent the probability of SVP occuring; for simplicity, we leave this out of the bound in the blog post.</p>
<p>We’ll discuss key components of the proof of this theorem later on in the blog post.</p>
<h3 id="result-2-extending-the-lower-bound-to-the-anisotropic-subgaussian-case">Result #2: Extending the lower bound to the anisotropic subgaussian case</h3>
<p>Our version of Theorem 3 further extends Theorem 2 to the anisotropic subgaussian data model, at the cost of some more complexity.</p>
<p><em><strong>Theorem 3</strong> : For an anisotropic subgaussian sample, if \(d_2 = O(n \log n)\), \({d_\infty^2}/{d_2} = {\|\lambda\|_2^2}/{\|\lambda\|_\infty^2} = \Omega(n)\), and \(n\) is sufficiently large, then SVP occurs with probability at most \(0.1\).</em></p>
<p>The second condition ensures that the effective number of points with high variance is at least as large as \(n\). If it were not, then a very small number of features would have an outsize influence on the outcome of the problem, making it effectively a low-dimensional problem where the data are unlikely even to be linearly separable.</p>
<p>The first condition is slightly loose in the event that \(d_2 \gg d_\infty\), since Theorem 1 depends on \(d_\infty\) rather than \(d_2\).</p>
<h3 id="result-3-proving-a-sharp-threshold-for-the-isotropic-gaussian-case">Result #3: Proving a sharp threshold for the isotropic Gaussian case</h3>
<p>Returning to the simple isotropic Gaussian regime, we show a clear threshold in the regime where \(n\) and \(d\) become arbitrarily large. Theorem 4 shows that the phase transition occurs precisely when \(d = 2n \log n\) in the asymptotic case. Check out the paper for a rigorous asymptotic statement and a proof that depends on the maximum of weakly dependent Gaussian variables.</p>
<p><em>Note: One nice thing about working with a statistician is that we have different flavors of bounds that we like to prove. As a computer scientist, I’m accustomed to proving Big-\(O\) and Big-\(\Omega\) bounds for finite \(n\) and \(d\) in Theorem 3, while hiding foul constants behind the asymptotic notation. On the other hand, Navid is more interested in the kinds of sharp trade-offs that occur in infinite limits, like those in Theorem 4.
Our collaboration meant we featured both!</em></p>
<h3 id="result-4-suggesting-the-threshold-extends-beyond-that-case">Result #4: Suggesting the threshold extends beyond that case</h3>
<p>While we only prove the location of the precise threshold and the convergence to that threshold for the isotropic Gaussian regime, we believe that it persists for a broad class of data distributions, including some that are not subgaussian. Our Figure 1 visualizes this universality by visualizing the fraction of trials on synthetic data where support vector proliferation occurs when the samples are drawn from each type of distribution.</p>
<p><img src="/assets/images/2021-12-07-ash21/univ.png" alt="" /></p>
<h3 id="conjecture-generalization-to-l_1-and-l_p-models">Conjecture: Generalization to \(L_1\) (and \(L_p\)) models</h3>
<p>We conclude by generalizing the SVM vs OLS problem to different norms and making a conjecture that the SVP threshold occurs when \(d\) is much larger for the \(L_1\) case. For the sake of time, that’s all I’ll say about it here, but check out the paper to see our formulation of the question and some supporting empirical results.</p>
<h2 id="proof-of-result-1">Proof of Result #1</h2>
<p>I’ll conclude the post by briefly summarizing the foundations of our proof of the simplified version of Theorem 3. This was an adaptation of the techniques employed by HMX21 to prove Theorem 2, but it required a more careful approach to handle the lack of independence among a collection of random variables.</p>
<h3 id="equivalence-lemma">Equivalence lemma</h3>
<p>Both papers rely on the same “leave-one-out” equivalence lemma for their upper and lower bounds. We prove a more general version in our paper based on geometric intuition, but I give only the simpler one here.</p>
<p>Let \(y_{\setminus i} = (y_1, \dots, y_{i-1}, y_{i+1}, \dots, y_n) \in \mathbb{R}^{n-1}\) and \(X_{\setminus i} = (x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) \in \mathbb{R}^{(n-1) \times d}\).</p>
<p><em><strong>Lemma 1:</strong> Every training sample is a support vector (i.e. SVP occurs and OLS=SVM) if and only if \(u_i := y_i y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} x_i < 1\) for all \(i \in [n]\).</em></p>
<!-- As a result, SVP occurs (and OLS = SVM) if and only if $$\max_i u_i < 1$$.
-->
<p>This lemma looks a bit ugly as is, so let’s break it down and explain hazily why the connection these \(u_i\) quantities connect to whether a sample is a support vector.</p>
<ul>
<li>Noting that \(X_{\setminus i}^\dagger = (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i}\) when \(d \gg n\) and referring back to the optimization problems from before, we can let \(y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} := w_{OLS, i}^T\), because it represents the parameter vector obtained by running OLS on all of the training data except \((x_i, y_i)\).</li>
<li>Then, \(u_i = y_i x_i^T w_{OLS, i}\), which is the margin of the “leave-one-out” regression on the left out sample.</li>
<li>If \(u_i \geq 1\), then the OLS classifier on the other \(n-1\) samples already classifies \(x_i\) correctly by at least a unit margin. If the \(w_{OLS, i} = w_{SVM, i}\), then it suffices to take \(w_{SVM} = w_{OLS, i}\) without adding a new support vector for \(x_i\) and without increasing the cost of the objecting. Hence, the condition means that the partial solution offers proof that not everything needs to be a support vector.</li>
<li>If \(u_i < 1\), then \(x_i\) is not classified to a unit margin by \(w_{OLS, i}\). Therefore, adding \((x_i, y_i)\) back into the training set requires modifying the parameter vector; since the vector would then depend on \(x_i\), making \(x_i\) a support vector of \(w_{SVM}\).</li>
</ul>
<p>The remainder of the proof involves considering \(\max_i u_i\) and asking how large it must be.</p>
<h3 id="assuming-independence">Assuming independence</h3>
<p>In the Gaussian setting where \(X_{\setminus i}\) is fixed, \(u_i\) is a univariate Gaussian random variable of mean 0 and variance \(y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} X_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1}y_{\setminus i} = y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} y_{\setminus i}\).</p>
<p>Because \(\mathbb{E}[x_j^T x_j] = d\), it follows that \(\mathbb{E}[X_{\setminus i} X_{\setminus i}^T] = d I_{n-1}\) and that the eigenvalues of \(X_{\setminus i} X_{\setminus i}^T\) are concentrated around \(d\) with high probability.
As a result, the eigenvalues of \((X_{\setminus i} X_{\setminus i}^T)^{-1}\) are concentrated around \(1/d\), and the variance of \(u_i\) is roughly \(\frac{1}{d} \|y_{\setminus i}\|_2 = \frac{n-1}d\).</p>
<p>If we assume for the sake of simplicity that \(u_i\) are all independent of one another, then the problem becomes easy to characterize.
It’s well-known the maximum of \(n\) Gaussians of variance \(\sigma^2\) concentrates around \(\sigma \sqrt{2 \log n}\).
Hence, \(u_i\) will be roughly \(\sqrt{2(n-1)\log(n) / d}\) with high probability.
If \(d = \Omega(n \log n)\), then \(\max_{u_i} < 1\) with high probability and SVP occurs; if \(d = O(n \log n)\), then SVP occurs with with vanishingly small probability.</p>
<h3 id="overcoming-dependence">Overcoming dependence</h3>
<p>The key problem with the above paragraphs is that the random variables \(u_1, \dots, u_n\) are <em>not</em> independent of one another. They all depend on all of the data \(x_1, \dots, x_n\), and the core technical challenge of this result is to tease apart this dependence.
To do so, we rely on the fact that \(X_{\setminus i} X_{\setminus i}^T \approx d I_{n-1}\) and define a subsample of \(m \ll n\) points to force an independence relationship.
Specifically, we rely on the decomposition \(u_i = u^{(1)}_i + u^{(2)}_i + u^{(3)}_i\) for \(i \in [m]\) where:</p>
<ol>
<li>\(u^{(1)}_i = y_i y_{\setminus i}^T((X_{\setminus i} X_{\setminus i}^T)^{-1} - \frac1d I_{n-1}) X_{\setminus i} x_i\) represents the gap between the gram matrix \(X_{\setminus i} X_{\setminus i}^T\) and the identity.</li>
<li>\(u^{(2)}_i = \frac1d y_i y_{[m] \setminus i} X_{[m] \setminus i} x_i\) is the component of the remaining term (\(\frac1d y_i y_{\setminus i} X_{\setminus i} x_i\)) that depends exclusively on the subsample \([m]\).</li>
<li>\(u^{(3)}_i = \frac1d y_i y_{\setminus [m]} X_{\setminus [m]} x_i\) is the component that depends only on \(x_i\) and on samples <em>outside</em> the subsample. Critically, \(u^{(3)}_1, \dots, u^{(3)}_m\) are independent, conditioned on the data outside the sample, \(X_{\setminus [m]}\).</li>
</ol>
<p>To show that SVP occurs with very small probability, we must show that \(\max_i u_i \geq 1\) with high probability.
To do so, it’s sufficient to show that (1) for all \(i\), \(|u^{(1)}_i| \leq 1\); (2) for all \(i\), \(|u^{(2)}_i| \leq 1\); and (3) \(\max_i u^{(3)}_i \geq 3\). The main technical lemmas of the paper apply Gaussian concentration inequalities to prove (1) and (2), and leverage the independence of the \(u^{(3)}_i\)’s to prove that their maximum is sufficiently large.</p>
<p>This requires somewhat more advanced techniques, such as the <a href="https://en.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem_" target="_blank">Berry-Esseen theorem</a>, for the subgaussian case.</p>
<h2 id="whats-next">What’s next?</h2>
<p>We think the significance of this result is the tying together of seemingly dissimilar ML models by their behavior in over-parameterized settings. We think some immediate follow-ups on this include investigations into the generalized \(L_p\) SVM and OLS models, but further work could also work along the lines of <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a>, by connecting “classical” ML models (like maximum-margin models) to the implicit regularization behavior of more complex models.</p>
<p>Thanks for reading this post! If you have any questions or thoughts (or ideas about what I should write about), please share them with me.</p>Clayton SanfordHello, it’s been a few weeks since I finished my candidacy exam, and I’m looking forward to getting back to blogging on a regular basis. I’m planning on focusing primarily on summarizing others’ works and discussing what I find interesting in the literature, but I periodically want to share my own papers and explain them less formally. I did this a few months ago for first grad student paper on the approximation capabilities of depth-2 random-bottom-layer neural networks HSSV21.My candidacy exam is done!2021-11-17T00:00:00+00:002021-11-17T00:00:00+00:00http://blog.claytonsanford.com/2021/11/17/candidacy<p>After working my way through all thirty papers from my <a href="/2021/07/04/candidacy-overview.html" target="_blank">list</a>, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.</p>
<p>Anyways, <a href="/assets/files/candidacy-slides.pdf" target="_blank">here</a> are the slides I made for the presentation. At the end, there are thirty appendix slides, each of which gives a one-slide summary of a paper from the list.</p>
<p>I’m planning on continuing to blog from here, but not in such a structured fashion as I did with my OPML series.
I’m considering having some kind of weekly newsletter that gives quick recaps of papers I’ve read and things I find interesting, along with periodic longer posts about particularly neat papers, my own work, or personal stuff.
Thanks for reading, and stay tuned!</p>Clayton SanfordAfter working my way through all thirty papers from my list, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.[OPML#10] MNSBHS20: Classification vs regression in overparameterized regimes: Does the loss function matter?2021-11-04T00:00:00+00:002021-11-04T00:00:00+00:00http://blog.claytonsanford.com/2021/11/04/mnsbhs20<!-- [[OPML#10]](/2021/11/04/mnsbhs20.html){:target="_blank"} -->
<p><em>This is the tenth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Once again, we discuss a paper that shows how hard-margin support vector machines (SVMs) (or maximum-margin linear classifiers) can experience benign overfitting when the learning problem is over-parameterized.
The paper, <a href="https://arxiv.org/abs/2005.08054" target="_blank">“Classification vs regression in overparameterized regimes: Does the loss function matter?”</a>, was written by a Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu (my advisor!), and Anant Sahai.</p>
<p>While the kinds of results are similar to the ones discussed in <a href="/2021/10/28/cl20.html" target="_blank">last week’s post</a>, the methodology is quite different. Rather than studying the properties of the iterates of gradient descent, this paper shows that minimum-norm linear regression and SVMs coincide in the over-parameterized regime and shows that the models behave similarly in those cases; this phenomenon is known as <em>support vector proliferation</em> and discussed in depth by <a href="https://arxiv.org/abs/2009.10670" target="_blank">a follow-up paper by Daniel, Vidya, and Ji (Mark) Xu</a> and by <a href="https://arxiv.org/abs/2105.14084" target="_blank">my NeurIPS paper with Navid Ardeshir and Daniel</a>.</p>
<p>To make the point, the paper considers a narrow regime of data distributions and categorizes those distributions to determine (1) when the outputs of OLS regression and SVM classification coincide and (2) when each of those have favorable generalization error as the number of samples \(n\) and dimension \(d\) trend towards infinity.
We introduce their <em>bilevel ensemble</em> input distribution and their <em>1-sparse linear model</em> for determining labels.
Their results show that under similar conditions to those explored in BLLT19, benign overfitting is possible for classification algorithms like SVMs.
Indeed, for their distributional assumptions, benign overfitting is more common for classification than regression.</p>
<h2 id="ols-and-svm">OLS and SVM</h2>
<p>A key part of this paper’s story relies on the coincidence of support vector machines for classification and ordinary least squares for regression.
We introduce the two models and clarify why one might expect them to have similar solutions for the high-dimensional setting.</p>
<p>From last week, we define the hard-margin SVM classifier to be \(x \mapsto \text{sign}(\langle w_{SVM}, x\rangle)\) where</p>
\[w_{SVM} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \{-1, 1\}\).
This classifier maximizes the margins of linearly separable training data.
Notably, a training sample \((x_i, y_i)\) is a <em>support vector</em> if \(\langle w_{SVM}, x_i\rangle = y_i\), which means that \(x_i\) lies exactly on the margin and is as close as possible to the linear separator.
The hypothesis \(w_{SVM}\) can be alternatively represented as a linear combination of support vectors, which means that all samples not on the margin are irrelevant to the SVM classifier vector.
Traditionally, favorable generalization properties for SVMs are shown for the cases where the number of support vectors is small, which implies some degree of “simplicity” in the model.</p>
<p>If the model is over-parameterized (i.e. \(d > n\)), we define the <em>minimum-norm ordinary least squares (OLS) regression</em> predictor to be \(x \mapsto \langle w_{OLS}, x\rangle\) where</p>
\[w_{OLS} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle = y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}\).
The two are the same, except that the labels are \(\{-1, 1\}\) for SVM and \(\mathbb{R}\) for OLS and that the inequality constraints of the former are replaced by equalities in the latter.</p>
<p>Sufficient conditions for benign overfitting for OLS has been explored in past blog posts, like the ones on <a href="/2021/07/05/bhx19.html" target="_blank">BHX19</a>, <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, and <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a>.
Conditions for SVMs were explored in <a href="/2021/10/28/cl20.html" target="_blank">CL20</a>.
This paper unifies the two by showing cases where \(w_{OLS} = w_{SVM}\) and transfers the benign overfitting results from OLS to SVMs.</p>
<p>If we assume that both problems (regression and classification) have \(\{-1, 1\}\) labels, then \(w_{OLS} = w_{SVM}\) is implied by having \(\langle w_{SVM}, x_i\rangle = y_i\) for all \(i\), which means that every sample is a support vector.
This is the support vector proliferation phenomenon briefly discussed before.</p>
<h2 id="data-model">Data model</h2>
<p>They prove their results over a simple data distribution, which is a special case of the distributions explored by BLLT19.
Specifically, they consider <em>bilevel Gaussian ensembles</em>, features \(x_i\) are drawn independently from a Gaussian distribution with diagonal covariance matrix \(\Sigma\) with diagonals \(\lambda_1, \dots, \lambda_d\) for \(d = n^p\) satisfying</p>
\[\lambda_j = \begin{cases}
n^{p - r - q} & j \leq n^r \\
\frac{1 - n^{-q}}{1 - n^{r - p}} & j \geq n^r
\end{cases}\]
<p>for \(p > 1\), \(r \in (0, 1)\), and \(q \in (0, p-r)\).
It’s called a bilevel ensemble because the first \(n^r\) coordinates are drawn from higher variance normal distributions than the remaining \(n^p - n^r\) coordinates. A few notes on this model:</p>
<ul>
<li>Because \(p > 1\), \(d = \omega(n)\) and the model is always over-parameterized.</li>
<li>\(r\) governs the number of high-importance features. Because \(r < 1\), there must always be a sublinear number of high-importance features.</li>
<li>If \(q\) were permitted to be \(p - r\), then the model would be spherical or isotropic and have \(\lambda_j = 1\) for all \(j\). On the other hand, if \(q = 0\), \(\lambda_j = 0\) for \(j \geq n^r\) and all of the variance would be on the first \(n^r\) features. Thus, \(q\) modulates how much more variance the high-importance features have than the low-importance features.</li>
<li>
<p>The variances are normalized to have their \(L_1\) norms always be \(d = n^p\):</p>
\[\|\lambda\|_1 = \sum_{j=1}^{n^d} \lambda_j = n^{p} \cdot n^{-q} + \frac{(n^p - n^r)(1 - n^{-q})}{1 - n^{r-p}} = n^{p} \cdot n^{-q} + n^p (1 - n^{-q}) = n^p.\]
</li>
<li>
<p>We can compute the effective dimension terms used in the BLLT19 paper:</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} = \begin{cases}
\frac{(n^r - k)n^{p-q} + n^p (1 - n^{-q})}{n^{p-r-q}} &= \Theta(n^{r + q}) & k < n^r \\
n^p - k & & k \geq n^r.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2}= \begin{cases}
\Theta(\min(n^p, n^{r + 2q})) & k < n^r \\
n^p - k & k \geq n^r.
\end{cases}\]
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/lambdas.jpeg" alt="" /></p>
<p>The labels \(y\) are chosen with the <em>1-sparse linear model</em>, which only considers one of the coordinates. That is, for some \(t \leq n^r\), we let \(w^* = \lambda_t^{-1/2} e_t\), where \(e_t \in \mathbb{R}^d\) is the vector that is all zeroes, except for a one at index \(t\).
Note that \(\|w^*\|^2 = \lambda_t^{-1} = n^{r+q - p}\).
That is, the labels are \(\text{sign}(\langle w^*, x\rangle) = \text{sign}(x_t)\).
<!-- We add noise by flipping the label with probability $$\sigma$$. -->
(For regression, we instead think of the labels as \(\langle w^*, x\rangle = \lambda_t^{-1/2} x_t\).)</p>
<p>From this data model alone, we can plug in the bounds of BLLT19 to see what they tell us. <em>Note: There actually isn’t a perfect analogue here, because BLLT includes additive label noise with variance \(\sigma^2\), while this blog post only considers the noiseless case of MNSBHS20. The purpose of these bounds is to illustrate what is known about a similar model.</em></p>
<ul>
<li>If \(r + q > 1\), then \(k^* = \min\{k \geq 0: r_k(\Sigma) = \Omega(n)\}\) is roughly \(n^p - O(n)\), which means that the \(\frac{k^*}{n}\) term of the bound makes the bound vacuous.</li>
<li>
<p>If \(r + q < 1\), then \(k^* = 0\). Then, the BLLT19 bounds yield an excess risk of at most</p>
\[O\left( \|w^*\|^2 \lambda_1 \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{\sigma^2 n}{R_{0}(\Sigma)} \right) = O\left( \sqrt{n^{r + q - 1}}+ \sigma^2 \max(n^{1-p}, n^{1-r-2q}) \right).\]
<p>For this bound to trend towards zero, it must be true that \(r + 2q > 1\) and that \(r+ q < 1\), which is already guaranteed.</p>
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/bllt.jpeg" alt="" /></p>
<p>The bound given in the paper at hand will look slightly different. (e.g. it won’t have the first requirement because noise is done differently.)
In addition, it will distinguish between benign overfitting in the classification and regression regimes and show that it’s easier to obtain favorable generalization error bounds for regression.</p>
<h2 id="main-results">Main results</h2>
<p>They have two types of main results: Theorem 1 shows the sufficient conditions for the coincidence of the SVM and OLS weights \(w_{SVM}\) and \(w_{OLS}\), and Theorem 2 analyzes the generalization of the excess errors of both classification and regression.</p>
<h3 id="when-does-svm--ols">When does SVM = OLS?</h3>
<p><em><strong>Theorem 1:</strong> For sufficiently large \(n\), \(w_{SVM} = w_{OLS}\) with high probability if</em></p>
<p>\(\|\lambda\|_1 = \Omega(\|\lambda\|_2 n \sqrt{\log n} + \|\lambda\|_\infty n^{3/2} \log n)\).</p>
<p>Equivalently, it must hold that \(R_0(\Sigma) = \Omega(\sqrt{n}(\log n)^{1/4})\) and \(r_0(\Sigma) = \Omega(n^{3/2} \log n)\).
The holds for the bilevel model when \(r + q > \frac{3}{2}\).</p>
<p>In the <a href="https://arxiv.org/abs/2009.10670" target="_blank">two</a> <a href="https://arxiv.org/abs/2105.14084" target="_blank">follow-ups</a>, this bound is changed to \(r_0(\Sigma) = \Omega(n \log n)\) and the phenomenon is shown to NOT occur when \(R_0(\Sigma) = O(n \log n)\).
Thus, this can actually be shown to occur for the bilevel model when \(r + q > 1\).</p>
<p>The proof of the theorem in this paper relies on applying bounds on Gaussian concentration and properties of the <a href="https://en.wikipedia.org/wiki/Inverse-Wishart_distribution" target="_blank">inverse-Wishart distribution</a>.
The future results rely on tighter concentration bounds, a leave-one-out equivalence that is true when a sample is a support vector, and a trick that relates the relevant quantities to a collection of independent random variables.</p>
<h3 id="generalization-bounds">Generalization bounds</h3>
<p>Their generalization bounds apply to the OLS solutions for two cases, (1) where the labels are real-valued and (2) where the labels are Boolean \(\{-1,1\}\).
We call the minimum norm solutions of these \(w_{OLS, real}\) and \(w_{OLS, bool}\).
Thus, when \(r\) and \(q\) are large enough for Theorem 1 to guarantee that OLS = SVM, then the bounds for Boolean labels apply.</p>
<p><em><strong>Theorem 2:</strong> For a bilevel data model that is 1-sparse without label noise, the classification error \(\lim_{n \to \infty} \mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0]\) and regression excess MSE error \(\lim_{n \to \infty} \mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2]\) satisfy the following for the given settings of \(p\), \(q\), and \(r\):</em></p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{OLS, bool}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (0, 1)\)</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p>This table tells us several things about the differences in generalization between classification and regression.</p>
<ul>
<li>When \(\Sigma\) has a relatively even distribution of variance between the high-importance and low-importance coordinates and when there are relatively few coordinates, there tends to be favorable generalization for both classification and regression.
The reverse is true when there is a sharp cut-off between variances and when there are many high-importance features.
This fits a similar intuition to BLLT19, which forbids too sharp a decay of variances.</li>
<li>One might observe that this doesn’t have the other requirement from BLLT: that the variances do not decay too gradually, which is enforced by \(r + 2q > 1\). This is absent here because this data model does not include label noise, so the risk of a model being corrupted by overfitting noisy labels is minimized.</li>
<li>There is also a regime in between where classification generalizes well, but regression does not.</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/ols.jpeg" alt="" /></p>
<p>By combining the improved results on support vector proliferation with Theorem 2, we can obtain the following table of results for SVM vs OLS.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{SVM}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p><img src="/assets/images/2021-11-04-mnsbhs20/svm.jpeg" alt="" /></p>
<p>How do these generalization bounds work? They’re similar to the flavor of argument given in <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, which considers signal bleed and signal contamination.
Put roughly, an interpolating model can perform poorly if either the true signal gets split up among a bunch of orthogonal aliases that each interpolate the training data (signal bleed), or too many spurious correlations are incorporated into the chosen alias (signal contamination).
They assess and bound these notions by introducing the <em>survival</em> and <em>contamination</em> terms as</p>
\[\mathsf{SU}(w, t) = \frac{w_t}{w^*_t} = \sqrt{\lambda_t} w_t \ \text{and} \ \mathsf{CN}(w, t) = \sqrt{\mathbb{E}[(\sum_{j\neq t} w_j x_j)^2]} = \sqrt{\sum_{j \neq t} \lambda_j w_j^2}\]
<p>This formulation is easy due to the 1-stable assumption of the labels.
It seems like it may be possible to write something similar without this data model, but it would probably require much uglier expressions and more complex distributional assumptions to make the proof work.</p>
<p>The proof then uses Proposition 1 to relate the classification and regression errors to the survival and contamination terms and concludes by using Lemmas 11, 12, 13, 14, and 15 to place upper and lower bounds on those terms. Prop 1 shows the following relationships:</p>
\[\mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0] = \frac12 - \frac1{\pi} \tan^{-1} \left(\frac{\mathsf{SU}(w, t)}{\mathsf{CN}(w, t)} \right)\]
\[\mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2] = (1 - \mathsf{SU}(w, t))^2 + \mathsf{CN}(w, t)^2\]
<p>From looking at these terms, it should be intuitive why classification error is more likely to go to zero than regression error: It is sufficient for \(\mathsf{CN}(w, t)\) to become arbitrarily small for the the classification error to approach zero, even if \(\mathsf{SU}(w, t)\) is a constant smaller than 1. On the other hand, it must be the case that \(\mathsf{CN}(w, t)\to 0\) <em>and</em> \(\mathsf{SU}(w, t)\to 1\) for the regression error to go to zero.</p>
<p>The concentration bounds in the lemmas are gory and I don’t plan to go into them here. They rely on a slew of concentration bounds that are made possible by the Gaussianity of the inputs and the tight control of their variances.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>This was another really interesting paper for me, although I wasn’t quite brave enough to venture through all of the proofs of this one.
It’s primarily interesting as a proof of concept; the assumptions are prohibitively restrictive (only one relevant coordinate, Gaussian inputs), but the proofs would have been sickening to the point of being unreadable if many of these assumptions were dropped. This paper was an inspiration for my collaborators and me to investigate support vector proliferation in more depth, and these are a nice complement to CL20, which proves bounds for more restricted values of \(d\) and without relying on limits.</p>
<p>Thanks for joining me once again! The next entry–and possibly the last entry of this series–will be posted next week. When the actual exam occurs in two weeks, I might have one last recap post of what’s been discussed so far.</p>Clayton Sanford[OPML#9] CL20: Finite-sample analysis of interpolating linear classifiers in the overparameterized regime2021-10-28T00:00:00+00:002021-10-28T00:00:00+00:00http://blog.claytonsanford.com/2021/10/28/cl20<!-- [[OPML#9]](/2021/10/28/cl20.html){:target="_blank"} -->
<p><em>This is the ninth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Like <a href="/2021/10/20/boosting.html" target="_blank">last week’s post</a>, we’ll step away from linear regression and discuss how over-parameterized <em>classification</em> models can achieve good generalization performance.
Unlike last week’s post, we focus on <em>maximum-margin classifiers</em> (or <em>support vector machines</em>) that interpolate the data in high-dimensional settings.
The paper is called <a href="https://arxiv.org/abs/2004.12019" target="_blank">“Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime”</a> and was written by Nilidri Chatterji and Philip Long.</p>
<h2 id="maximum-margin-classifier">Maximum-margin classifier</h2>
<p>Suppose we have some linearly separable training data.
There are many different strategies of choosing a linear separator for those data, and it’s unclear off the bat which ones will generalize best to novel samples.
To sketch the issue, the below visualization shows how two linearly separable classes have many valid hypotheses that interpolate the training data and have zero training error.</p>
<p><img src="/assets/images/2021-10-28-cl20/separators.jpeg" alt="" /></p>
<p>The <em>maximum-margin classifier</em> chooses the separating hyperplane that, well, maximizes the margins between the separator and the two classes.
In the below visualization, the yellow separator is the hyperplane orthogonal to the vector \(w\) that most decisively classifies every positive and negative sample correctly.
That is, none of the sample are close to the separator, and \(w\) is chosen to have the largest <em>margin</em>, or gap between the data and the separator.
The space between the solid separator and the two dashed lines is the margin, a sort of demilitarized zone between the two classes of samples.</p>
<p><img src="/assets/images/2021-10-28-cl20/margin.jpeg" alt="" /></p>
<p>In order to quantify the margin, we require that \(w\) is chosen to ensure that \(\langle w, x_i\rangle \geq y_i\) for \(y_i \in \{-1,1\}\).
The width of the margin can be computed to be at least \(\frac1{\|w\|}\) if we enforce this requirement.
Therefore, maximum-margin classifier is</p>
\[\mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^p} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>where \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^p \times \{-1, 1\}\) are the training samples.</p>
<p>Last week’s blog post discussed in detail why maximum-margin classifiers can lead to good generalization. Primarily, having large margins means that the classifier is robust and will categorize correctly samples that are drawn near any of the training samples.
This is great, and provided a ton of insight into why overfitting is not always a bad thing. However, those results were limited in their applicability:</p>
<ul>
<li>They only apply to voting-based classifiers with margins, and this maximum-margin classifier does not <em>directly</em> aggregate together multiple weak classifiers. (One could think of a linear combination of features as being a combination of other classifiers, but those are not explicitly spelled out in the maximum-margin classifier.)</li>
<li>Their bounds only apply to perfectly clean training data; if an \(\eta\)-fraction of the samples have incorrect labels, then their bounds fall apart.</li>
</ul>
<p>This paper suggests that these kinds of bounds are possible for the maximum margin classifier when the dimension is much larger than the number of samples.</p>
<p><em>Aside: Their formulation of the maximum-margin classifier is identical to that of the</em> support vector machine (SVM)<em>. The samples that lie on the margin (in our case, two red samples and two blue samples on the dotted lines) are</em> support vectors, <em>which the separator can be written in terms of. Classical capacity-based generalization approaches for SVMs relies on having few support vectors, but <a href="https://arxiv.org/abs/2005.08054" target="_blank">some</a> <a href="https://arxiv.org/abs/2011.09148" target="_blank">recent</a> <a href="https://arxiv.org/abs/2104.13628" target="_blank">works</a> have shown that generalization bounds can be proved in a setting with many support vectors. <a href="https://arxiv.org/abs/2105.14084" target="_blank">One of my papers</a>, which will appear at NeurIPS 2021 (and which I’ll discuss in a forthcoming blog post) proves when</em> support vector proliferation <em>, a phenomena where every samples is a support vector, occurs.</em></p>
<h2 id="data-model">Data model</h2>
<p>Like the linear regression papers we’ve discussed, this paper exhibits the phenomenon of benign overfitting under strict distributional assumptions. We present a simplified version of their data model below.</p>
<ul>
<li>A label \(\tilde{y} \in \{-1,1\}\) is chosen by a coin flip. With probability \(\eta\) (which can be no larger than some constant less than 1), the label is <em>corrupted</em> and \(y = - \tilde{y}\). Otherwise, \(y = \tilde{y}\).</li>
<li>For some <em>mean vector</em> \(\mu \in \mathbb{R}^p\) and some \(q\) drawn from a \(p\)-dimensional subgaussian distribution with a lower-bound on expected norm, the input \(x\) is chosen to be \(q + \tilde{y} \mu\).</li>
</ul>
<p>That is, the inputs belong to one of two regions: either clustered around \(\mu\) if \(\tilde{y} = 1\) and \(-\mu\) if \(\tilde{y} = -1\).
Intuitively, this means the learning problem is much easier if \(\mu\) is large, because the clusters will be more sharply separated.</p>
<p><img src="/assets/images/2021-10-28-cl20/data.jpeg" alt="" /></p>
<p>The data model is limited by the fact that they assume this kind of two-cluster structure. However, it’s intended as a proof of concept of sorts, and the setup allows one to explore how changing the number of samples \(n\), the dimension \(d\), and the distinctiveness of classes \(\|\mu\|^2\) shapes which bounds are possible.</p>
<p>They give several examples of this data model, and I’ll recount their Example 3, which they call the <em>Boolean noisy rare-weak model.</em>
They sample \(y\) and \(\tilde{y}\) as above.
\(x\) is drawn from a distribution over \(\{-1,1\}^p\), where \(x_1, \dots, x_s\) independently equal \(\tilde{y}\) with probabililty \(\frac12 + \gamma\) and \(-\tilde{y}\) otherwise, for some \(s \leq p\) and \(\gamma \in (0, \frac12)\). \(x_{s+1}, \dots, x_p\) are the results of independent fair coin tosses.</p>
<h2 id="main-result">Main result</h2>
<p>Their main result is a generalization bound for this two-cluster data model.
The result relies on several assumptions about \(n\), \(d\), and \(\mu\).</p>
<p><em><strong>Theorem 4:</strong> Suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\|\mu\|^2n, n^2 \log n))\), (3) \(\|\mu\|^2 = \Omega(\log n)\), and (4) \(p = O(\|\mu\|^4 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>The main inequality is a bound on the generalization error of the classifier \(w\) because it deals with new samples, rather than the ones used to train the classifier.
The \(\eta\) term in the error is unavoidable, because any sample will be corrputed with probability \(\eta\).
The \(\epsilon\) term is the more interesting one, which governs the excess error.</p>
<p>The requirement that \(p = \Omega(n^2 \log n)\) means the model must be in a <em>very</em> high-dimensional regime. Recall that papers like <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a> consider a regime where \(p = \Theta(n)\); here, this paper only says anything about generalization when \(p\) is much larger than \(n\). We also require pretty specific conditions about \(\mu\).</p>
<p>To make life easier, let \(\mu = (q, 0, \dots, 0) \in \mathbb{R}^p\). The excess error can only be small then if \(q \gg p^{1/4}\). Since it must also be the case that \(q \ll \sqrt{p/ n}\), this gives a relatively narrow interval that \(p\) can belong to.</p>
<p>They formulate the theorem specifically for the example we consider as well.</p>
<p><em><strong>Corollary 6:</strong> For the Boolean noisy rare-weak model, suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\gamma^2 s n, n^2 \log n))\), (3) \(\gamma^2 s = \Omega(\log n)\), and (4) \(p = O(\gamma^4 s^2 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>This means that if \(\gamma\) is some constant like \(0.25\), it must be true that \(s \gg \sqrt{p}\) and \(s \ll p/n\).
Therefore, only a small fraction of the dimensions of \(x\) can be indicative of the label \(y\), and most of the input is just noise.
Or, if \(s = p\) and every feature is significant, then \(\gamma\) must satisfy \(\gamma \ll 1/\sqrt{n}\) and \(\gamma \gg 1 / p^{1/4}\), which means that each feature will only have a minute amount of signal.
This closely resembles the kinds of settings that we showed have good generalization for linear regression in <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a> long ago.</p>
<h2 id="proof-overview">Proof overview</h2>
<p>The proof relies on a proof by <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a> that using gradient descent to optimize logistic regression for separable data gives a separating hyperplane that maximizes margins.
That is, gradient descent with a logistic loss function has an implicit bias that leads to the same solution as to that of an SVM.</p>
<p>In Lemma 9, they use a simple concentration bound to show that the generalization error is small if \(\langle w, \mu\rangle\) is small, where \(\mu\) is the mean vector and \(w\) is the learned classifier.
They relate this to the classifiers obtained in each step of gradient descent \(v^{(t)}\) and bound \(\langle v^{(t)}, \mu\rangle\) by expanding the gradient step to write \(v^{(t)}\) in terms of all previous risks.
Taking a limit of \(t \to \infty\) relates this to the maximum-margin classifier.</p>
<p>Lemma 10 lower-bounds the target inner product. A key component of the proof of that is Lemma 14, which shows that the loss caused by any one sample cannot be much more than that of any other sample with high probability.
This is important because it means that the noisy samples (with flipped \(\mu\)) cannot have outsize impact on the result, and that the analysis is robust to those errors.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>This paper was neat, since it showed something similar to what was uncovered about minimum-norm linear regression by a variety of papers previously surveyed.
It’s neat to also see this as a strengthening of the margin work discussed last week under boosting, since these results work for samples with noisy labels and for non-voting margin-based classifiers.</p>
<p>However, they’re limited by degree of over-parameterization/the size of the dimension needed; \(p = \Omega(n^2 \log n)\) is a pretty steep requirement, especially since results like my <a href="https://arxiv.org/abs/2105.14084" target="_blank">OLS=SVM paper</a> suggest that minimum-norm regression (with samples drawn with labels in \(\{-1,1\}\)) and maximum-margin classifiers coincide when \(p = \Omega(n \log n)\).
They specifically identify the improvement on the dependence of \(p\) as motivation for future work, and I hope to see that tackled at some point.</p>
<p><em>Thanks for reading this week’s entry! The actual exam is coming up on November 16th, and you should expect at least two more posts about papers before then!</em></p>Clayton Sanford[OPML#8] FS97 & BFLS98: Benign overfitting in boosting2021-10-20T00:00:00+00:002021-10-20T00:00:00+00:00http://blog.claytonsanford.com/2021/10/20/boosting<!-- [[OPML#8]](/2021/10/20/boosting.html){:target="_blank"} -->
<p><em>This is the eighth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p><em>In other news, there’s <a href="https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-of-deep-learning-20211011/" target="_blank">a cool Quanta article</a> that touches on over-parameterization and the analogy between neural networks & kernel machines that just came out. Give it a read!</em></p>
<p>When conducting research on the theoretical study of neural networks, it’s common to joke that one’s work was “scooped” by a paper in the 1990s.
There’s a lot of classic ML theory work that was published well before the deep learning boom of the last decade.
As a result, it’s common for researchers to ignore it and unknowingly repackage old ideas as novel.</p>
<p>This week, I finally escape my pattern of discussing papers from the ’10s and ’20s by presenting a pair of seminal papers from the late ’90s: <a href="https://www.sciencedirect.com/science/article/pii/S002200009791504X" target="_blank">FS97</a> and <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-26/issue-5/Boosting-the-margin--a-new-explanation-for-the-effectiveness/10.1214/aos/1024691352.full" target="_blank">BFLS98</a>.
Both of these papers cover <em>boosting</em>, a learning algorithm that aggregates many <em>weak learners</em> (heuristics that perform just better than chance) into a much better prediction rule.</p>
<ul>
<li>FS97 introduces the <em>AdaBoost</em> algorithm, proves that it can combine weak learners to perfectly fit a training dataset, and gives generalization bounds based on VC-dimension.
The authors note that empirically, the algorithm performs much better than these capacity-based bounds and exhibits some form of <em>benign overfitting</em> (which has been extensively discussed in posts like <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="/2021/09/11/xh19.html" target="_blank">[OPML#6]</a>).</li>
<li>BFLS98 addresses that mystery and resolves it by giving a different type of generalization bound, a <em>margin-based bound</em>, which explains why the generalization performance of AdaBoost continues to improve after it correctly classifies the training data.</li>
</ul>
<p>These papers fit into the series because they exhibit a very similar phenomenon to the one we frequently encounter with over-parameterized linear regression and in deep neural networks:
A learning algorithm is trained to zero training error and has small generalization error, despite capacity-based generalization bounds suggesting that this should not occur.
Moreover, the generalization error continues to decrease as the model becomes “more over-parameterized” and continues to train beyond zero training error.
These papers highlight the significance of <em>margin bounds</em>, which have been studied in papers <a href="https://arxiv.org/abs/1909.12292" target="_blank">like</a> <a href="https://arxiv.org/abs/1706.08498" target="_blank">these</a> in the context of neural network generalization.</p>
<p>We’ll jump in by explaining boosting, before discussing capacity-based and margin-based generalization bounds and the connection to benign overfitting.</p>
<h2 id="boosting">Boosting</h2>
<p>We motivate and discuss the boosting algorithm presented in FS97.</p>
<h3 id="population-training-and-generalization-errors">Population, training, and generalization errors</h3>
<p>To motivate the problem, consider a setting where the goal is to learn a classifier from training data.
That is, you (the learner) have \(m\) samples \(S = \{(x_1, y_1), \dots, (x_m, y_m)\} \subset X \times \{-1,1\}\) drawn independently from some distribution \(\mathcal{D}\).
The goal is to learn some <em>hypothesis</em> \(h: X \to \{-1,1\}\) with low population error, that is</p>
\[\text{err}_{\mathcal{D}}(h) = \text{Pr}_{(x, y) \sim \mathcal{D}}[h(x) \neq y].\]
<p>To do so, we follow the strategy of <em>empirical risk minimization</em>, that is choosing the \(h\) that minimizes <em>training error</em>:</p>
\[\text{err}_S(h) = \sum_{i=1}^m \mathbb{1}\{h(x_i) \neq y_i\}.\]
<p>Often, the goal is to obtain a <em>PAC learning</em> (Probably Approximately Correct learning) guarantee, which entails showing that there exists some learning algorithm that gives a hypothesis \(h\) with probability \(1 - \delta\) such that \(\text{err}_{\mathcal{D}}(h) \leq \epsilon\) in time \(O(\frac{1}{\epsilon}, \frac1\delta)\) for any small \(\epsilon, \delta > 0\).</p>
<p>We can decompose the population error into two terms and analyze when algorithms succeed and fail based on the two:</p>
\[\text{err}_{\mathcal{D}}(h) = \underbrace{\text{err}_{\mathcal{D}}(h)-\text{err}_S(h)}_{\text{generalization error}} + \underbrace{\text{err}_S(h).}_{\text{training error}}\]
<p>This framing implies two very different types of failure modes.</p>
<ol>
<li>If the training error is large when \(h\) is an empirical risk minimizing hypothesis, then there is a problem with expressivity. In other words, there is no hypothesis that closely fits the training data, which means that there is very likely no hypothesis will succeed on random samples drawn from \(\mathcal{D}\).</li>
<li>If the generalization error is large, then the sample \(S\) is not representative of the distribution \(\mathcal{D}\). <em>Overfitting</em> refers to the issue where the training error is small and the generalization error is large; the hypothesis does a good job memorizing the training data, but it learns little of the actual underlying learning rule because there aren’t enough samples. This typically occurs when \(h\) comes from a family of hypotheses that are <em>too complex.</em></li>
</ol>
<p>We can visualize these trade-offs with respect to the model complexity below, as they’re understood by traditional capacity-based ML theory. (There’s a very similar image in the introductory post of this blog series.)</p>
<p><img src="/assets/images/2021-10-20-boosting/descent.jpeg" alt="" /></p>
<p>While these blog posts focus on problematizing this picture by exhibiting cases where there is <em>both</em> overfitting and low generalization error, we introduce boosting in the context of solving the opposite problem: What do you do when the model complexity is too low, and no hypotheses do a good job of even fitting the training data?</p>
<h3 id="limitations-of-linear-classifiers">Limitations of linear classifiers</h3>
<p>Consider the following picture:</p>
<p><img src="/assets/images/2021-10-20-boosting/redblue.jpeg" alt="" /></p>
<p>Suppose our goal is to find the best linear classifier that separates the red data (+1) from the blue data (-1) and (ideally) will also separate new red data from new blue data.
However, there’s an immediate problem: no linear classifier can be drawn on the training data without a training error better than \(\frac13\). For instance, the following separator (which labels everything with \(\langle w, x\rangle > 0\) red and everything else blue) for some vector \(w \in \mathbb{R}^2\) performs poorly on the upper “slice” of red points and the lower slice of blue points.</p>
<p><img src="/assets/images/2021-10-20-boosting/line1.jpeg" alt="" /></p>
<p>Neither of these are any good either.</p>
<p><img src="/assets/images/2021-10-20-boosting/line23.jpeg" alt="" /></p>
<p>All three of the above linear separators have roughly a \(\frac23\) probability of classifying a sample correctly, but they each miss a different slice of the data.
A natural question to ask is: Can these three separators be combined in some way to improve the training error of the classifier?</p>
<p>The answer is yes. By taking a <em>majority vote</em> of the three, one can correctly classify all of the data. That is, if at least two of the three linear classifiers think the point is red, then the final classifier predicts that the point is red.
The following is a visualization of how this voting scheme works. (Maroon regions have 2 separators saying “red” and are classified as red. Purple regions have 2 separators saying “blue” and are classified as blue.)</p>
<p><img src="/assets/images/2021-10-20-boosting/vote.jpeg" alt="" /></p>
<p>We increase the complexity of the model (by aggregating together three different classifiers), which gets us down to zero training error in this case.
This helps solve the issue about approximation–but it presents a new one on generalization. Can we expect this new “voting” classifier to perform well, since it’s more complex than just the linear classifier?</p>
<p><em>Boosting</em> is an algorithm that formalizes this voting logic in order to string together a bunch of weak classifiers into one that performs well on all of the training data. In the last two sections of the blog post, we give two takes on generalization of boosting approaches, to answer the aforementioned question about whether we expect this kind of overfitting to hurt or not.</p>
<h3 id="weak-learners">Weak Learners</h3>
<p>The linear classifiers above are examples of <em>weak learners</em>, which perform slightly better than chance on the training data and which we combine together to make a stronger learner.</p>
<p>To formalize that concept, we say that a learning algorithm is a <em>weak learning algorithm</em> or a <em>weak learner</em> if it can PAC-learn a family of functions \(\mathcal{C}\) with error \(\epsilon = \frac12 - \eta\) with probability \(1- \delta\) where samples are drawn from some distribution \(\mathcal{D}\).</p>
<p>The idea with weak learning in the context of boosting is that you use the weak learning algorithm to obtain a classifier \(h\) that weak-learns the family over some weighted distribution of the samples.
Then, the distribution can be modified accordingly, in order to ensure that the next weak learner performs well on the samples that the original hypothesis performed poorly on.
In doing so, we gradually find a cohort of weak classifiers, such that each sample is correctly classified by a large number of weak learners in the cohort.</p>
<p><img src="/assets/images/2021-10-20-boosting/wl.jpeg" alt="" /></p>
<p>The graphic visualizes this flow.
The top-right image represents the first weak classifier found on the distribution that samples evenly from the training data. It performs well on at least \(\frac23\) of the samples.
Then, we want the weak learning algorithm to give another weak classifier, but we want it to be different and ensure that other samples are correctly classified, particularly the ones misclassified by the first one.
Therefore, we amplify those misclassified samples in the distribution (bottom-left) and learn a new learning rule on that reweighted distribution.
For that learning rule to qualify as a weak learner, it must classify \(\frac23\) of the <em>weighted</em> samples correctly. To do so, it’s essential that it correctly classifies the previously-misclassified samples.
Hence, it chooses a different rule.
Continuing to iterate this will give a wide variety of weak learners.</p>
<p>This intuition is formalized in the AdaBoost algorithm.</p>
<h3 id="adaboost">AdaBoost</h3>
<p>Here’s how the algorithm works, as stolen from FS97.</p>
<ul>
<li>Input: some input set of samples \((x_1, y_1), \dots, (x_m, y_m)\), a number of rounds \(T\), and a procedure <strong>WeakLearn</strong> that outputs a weak learner given a distribution over samples.</li>
<li>Initialize \(w^1 = \frac{1}{m} \vec{1} \in [0,1]^m\) to be a uniform starting distribution over training samples. (Note: the algorithm in the paper works for a general starting distribution, but we stick to the uniform distribution for simplicity.)</li>
<li>For round \(t \in [T]\), do the following:
<ol>
<li>Update the probability distribution by normalizing the current weight vector: \(p^t = \frac{1}{\|w^t\|_1} w^t.\)</li>
<li>Use <strong>WeakLearn</strong> to obtain a weak learner \(h_t: X \to [-1,1]\).</li>
<li>Calculate the error of \(h^t\) on the <em>weighted</em> training samples: \(\epsilon_t = \frac12 \sum_{i=1}^m p_i^t \lvert h_t(x_i) - y_i\rvert\). (Note: this differs by a factor of \(\frac12\) from the version presented in the paper because we assume the output of the functions to be \([-1,1]\) rather than \([0,1]\).)</li>
<li>Let \(\beta_t = \frac{\epsilon_t}{1 - \epsilon_t} \in (0,1)\) inversely represent roughly how much weight should be assigned to \(h^t\) in the final classifier. (If \(h^t\) has small error, then it’s a “helpful” classifier that should be given more priority.)</li>
<li>Adjust the weight vector by de-emphasizing samples that were accurately classified by \(h_t\). For all \(i \in [m]\), let</li>
</ol>
\[w_i^{t+1} = w_i^t \beta_t^{1 - |h_t(x_i) - y_i|}.\]
</li>
<li>
<p>Output the final classifier, a weighted majority vote of the weak learners:</p>
\[h_f(x) = \text{sign}\left(\sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t} \right).\]
<p>(This also differs from the final hypothesis in the paper because of the difference in output.)</p>
</li>
</ul>
<p>This formalizes the process illustrated above, where we rely on <strong>WeakLearn</strong> to produce learning rules that perform well on samples that have been misclassified frequently in the past.</p>
<p>Why is it called <strong>Ada</strong>Boost?
Unlike previous (less famous) boosting algorithms, it doesn’t require that all of the weak learners have minimum accuracy that is known to the algorithm.
Rather, it can work with all errors \(\epsilon_t\) and hence <em>adapt</em> to the samples given.</p>
<p>It’s natural to ask about the theoretical properties of the algorithm.
Specifically, can AdaBoost successfully aggregate a bunch of weak learners into a “strong learner” that classifies all but an \(\epsilon\) fraction of the training samples for any \(\epsilon\)?
And if so, how many rounds \(T\) are needed?
And how small must we expect \(\epsilon_t\) (the accuracy of each weak learner) to be?
This leads us to the main AdaBoost theorem.</p>
<p><em><strong>Theorem 1</strong> [Performance of AdaBoost on training data, Theorem 6 of FS97]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the error of the final hypothesis \(h_f\) is bounded by</em></p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)}.\]
<p>From this, one can naturally ask: How long will it take to classify all of the training data? For that to be the case, it suffices to show that \(\epsilon < \frac1m\), because there are only \(m\) samples and they cannot be “fractionally” correct.</p>
<p>For the sake of simplicity, we calculate the \(T\) necessary for \(\epsilon_t \leq 0.4\). (That is, each weak learner has advantage at least 0.1.)</p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)} \leq 2^T (0.24)^{T/2} = (2 \sqrt{0.24})^T < \frac{1}{m},\]
<p>which occurs when</p>
\[T > \frac{\log m}{\log (1 / (2 \sqrt{0.24}))} \approx 113 \log m.\]
<p>This is a really nice bound to have! It tells us that the training error can be rapidly bounded, despite only having the ability to aggregate classifiers that perform slightly better than chance.</p>
<p>The proof is simple and elegant, and I’m not going into it much.
It’s well-explained by the paper, but much of it boils down to the intuition that if a training sample is neglected by many weak learners, then its emphasis continues to increase until it can no longer be ignored without meeting the weak learnability error guarantees.</p>
<p>Despite all of these nice things, this theorem is limited. It only covers the performance of the weighted majority classifier on the training data and says nothing about generalization.
Indeed, it’s reasonable to fret about the generalization performance of this aggregate classifier.
If we substantially increased the expressibility of the weak learning classifiers by combining them, then wouldn’t capacity-based generalization theory tell us that this will trade-off generalization?
And isn’t it further compromised by the fact that training for a relatively small number of rounds leads to an aggregate hypothesis that perfectly fits the training data?</p>
<p>We focus for the remainder of the post on generalization, first examining it through the lens of classical capacity-based generalization theory, as done by FS97.</p>
<h2 id="capacity-based-generalization">Capacity-based generalization</h2>
<p>Looking back on the first visual of this post, classical learning theory has a simple narrative for what boosting does:</p>
<ul>
<li>The individual weak classifiers provided by <strong>WeakLearn</strong> lie on the left side of the curve (low generalization error, high training error) because they have a poor training error. Thus, they cannot fit complex patterns and are likely intuitively “simple,” which could translate to a low VC-dimension and hence a low generalization error.</li>
<li>As each stage of the boosting algorithm runs, the aggregate classifier moves further to the right, improving training error at the cost of generalization error. After sufficiently many rounds \(T\) have occurred to drive the training error to zero, the generalization will be so large as to make any bound on population error vacuous.</li>
</ul>
<p>This intuition is made explicit by the generalization bound presented by FS97, which bounds the VC-dimension of a majority vote of classifiers with individual VC-dimension at most \(d\) and applies the standard VC-dimension bound on generalization.</p>
<p>They get the following bound, which combines their Theorem 7 and Theorem 8.</p>
<p><em><strong>Theorem 2</strong> [Capacity-based generalization bound] Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Suppose <strong>WeakLearn</strong> outputs hypotheses from a class \(\mathcal{H}\) having \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), the following inequality holds for all final hypotheses that can returned by AdaBoost \(h_f\):</em></p>
\[\text{err}_{\mathcal{D}}(h_f) \leq \underbrace{\text{err}_{S}(h_f)}_{\text{training error}} + \underbrace{O\left(\sqrt{\frac{dT\log(T)\log(m/dT) + \ln\frac1{\delta}}{m}}\right).}_{\text{generalization error}}\]
<p>This bound fits cleanly into the intuition described above.
To keep the generalization small, \(T\) and \(d\) must be kept small relative to the number of samples. Doing so forces the training error to be large, because Theorem 1 suggests that \(h_f\) will have small training error when (1) AdaBoost runs for many iterates (large \(T\)) or (2) <strong>WeakLearn</strong> produces accurate classifiers, which requires an expressive family of weak learners (large \(d\)).
Hence, we’re necessarily trading off the two types of error.</p>
<p>However, this isn’t the full story.
When running experiments, they confirmed that after many rounds, the training error approached zero (as expected by Theorem 1).
But they also found that the test error dropped along with the training error <em>and</em> that the test error continued to drop even after the training error went to zero.
To explain this phenomenon, we turn to BFLS98, where the authors explain this low generalization error using <em>margin-based</em> bounds rather than capacity-based bounds.</p>
<p><img src="/assets/images/2021-10-20-boosting/general.jpeg" alt="" /></p>
<h2 id="margin-based-generalization">Margin-based generalization</h2>
<p>A key idea in the story about margin-based generalization is that a classifier that correctly and <em>decisively</em> categorizes all the training data is more robust (and more likely to generalize) than one that nearly categorizes samples incorrectly.
Roughly, slightly perturbing the samples in the first case will lead to samples that have the same labels, while that may not be the case in the second case.</p>
<p>Analyzing this requires considering some notion of <em>margin</em>, which quantifies the decisiveness of the classification.
For now, consider a modified version of the weighted majority classifier derived from AdaBoost:</p>
\[h_f(x) = \sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t}.\]
<p>The only difference here is that we dropped the \(\text{sign}\) function, which means the output may be anywhere in \([-1,1]\).
\(h_f\) categorizes the sample \((x,y)\) correctly if \(yh_f(x) > 0\), because the sign of \(h_f\) will then match \(y\).
We say that \(h_f\) categorizes a sample correctly <em>with margin \(\theta > 0\)</em> if \(yh_f(x) \geq \theta\).
This means that–if \(h_f\) is an aggregation of a large number of weak classifiers–then a small number of those classifiers changing their outcomes will not change the overall outcome of \(h_f\).</p>
<p>There are two key steps that lead to new generalization bounds by BFLS98 for AdaBoost.</p>
<ol>
<li>AdaBoost (after sufficiently many rounds \(T\) and with sufficiently small weak learner errors \(\epsilon_t\)) will classify the sample \(S\) correctly with some margin \(\theta\).</li>
<li>Any linear combination of \(N\) classifiers (each of which has bounded VC dimension) with margin \(\theta\) on the training data has a generalization bound that depends on \(\theta\) and <em>not</em> on \(N\).</li>
</ol>
<p>They accomplish (1) by proving a theorem that is very similar in flavor and proof to the Theorem 1 we gave earlier.</p>
<p><em><strong>Theorem 3</strong> [Margins of AdaBoost on training data, Theorem 5 of BFLS98]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the final hypothesis \(h_f: X \to [-1,1]\) satisfies the following margin bound on the training set \((x_1, y_1), \dots, (x_m, y_m)\) for any \(\theta \in [0,1)\):</em></p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i) \leq \theta \}\leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t^{1-\theta}(1 - \epsilon_t)^{1 + \theta}}.\]
<p>To make matters more concrete once again, consider the case where \(\epsilon_t \leq 0.4\) as before.
Then, the bound gives</p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} \leq 2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2}.\]
<p>If we want all training samples to obey the condition, we enforce that the margin term is less than \(\frac1{m}\).
Consider two cases:</p>
<ul>
<li>By some calculations (with the help of WolframAlpha), if \(\theta = 0.1\), then \(y_i h_f(x_i) \geq \theta\) for all \(i \in [m]\) if \(T > 7260 \log m\). This is very similar to our application of Theorem 1, albeit with bigger constants.</li>
<li>
<p>If \(\theta = 0.2\), then</p>
\[2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2} = 2^T (0.4)^{0.4T}(0.6)^{0.6T} \approx 1.02^T,\]
<p>which means that the bounds can never guarantee that the margins will be that large with time.</p>
</li>
</ul>
<p>These bounds provide a way of finding a margin \(\theta\) dependent on \(T\) and errors \(\epsilon_1, \dots, \epsilon_T\), which will be useful in the second part.</p>
<p>To get (2), they prove a bound on the combination of weak learners with margin bounds.</p>
<p><em><strong>Theorem 4</strong> [Margin-based generalization; Theorem 2 of BFLS98]: Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Let \(\mathcal{H}\) be a family of “base classifiers” (weak learners) with \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), any weighted average \(h_f(x) = \sum_{j=1}^T p_j h^j(x)\) for \(p_j \in [0,1]\), \(\sum_j p_j = 1\), and \(h^j \in \mathcal{H}\) satisfies the following inequality:</em></p>
\[\text{err}_{\mathcal{D}}(h_f) = \text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} + O\left(\sqrt{\frac{d \log^2(m/d)}{m\theta^2} + \frac{\log(1/\delta)}{m}}\right).\]
<p>This is fantastic compared to Theorem 2 because the generalization bound does not worsen as \(T\) increases.
The opposite effect actually occurs: as AdaBoost continues to run, Theorem 3 shows that the margin increases (up to a point), which strengthens the bound without trade-off!</p>
<p>We can instantiate the bound in the setting described above to show what a nice generalization bound can look like for boosting. If, once again, \(\eta_t \leq 0.4\), then taking \(\theta = 0.1\) and \(T = 7260\log m\) gives</p>
\[\text{err}_{\mathcal{D}}(h_f) = O\left(\sqrt{\frac{d \log^2(m/d) + \log(1/\delta)}{m}} \right).\]
<p>In this case, we can have our cake and eat it too; we increase the model complexity and expressivity by increasing \(T\), but we don’t sustain the basic trade-offs between training and generalization error discussed at the beginning of the post.</p>
<p>To illustrate why, we give a high-level overview of the proof and show how the rough intuition that “decisive classification leads to robustness, leads to generalization” holds up.</p>
<ul>
<li>The proof uses an approximation of \(h_f = \sum_{j=1}^T p_j h^j\) by sampling \(N\) classifiers \(\hat{h}_1, \dots, \hat{h}_N\) independently from \(h^1, \dots, h^T\) weighted by \(p_1, \dots, p_T\). It averages them together to obtain \(g = \frac1{N} \sum_{k=1}^N \hat{h}_k.\)</li>
<li>
<p>The proof decomposes the population error term into other quantities by using properties of conditional probability:</p>
\[\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}\right] + \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}, y h_f(x) \leq 0\right].\]
</li>
<li>The second term can be shown to be small when \(N\) and \(\theta\) large with high probability over \(g\) by a Chernoff bound. Since \(h_f = \mathbb{E}[g] = \mathbb{E}[\hat{h}_k]\), it’s unlikely that \(yg(x)\) and \(yh_f(x)\) will differ by a large factor from one another.</li>
<li>By principles of VC dimension, the <a href="https://en.wikipedia.org/wiki/Sauer%E2%80%93Shelah_lemma" target="_blank">Sauer-Shelah lemma</a>, and concentration bounds (this time over the <em>sample</em>) for large \(m\), the first term will be roughly the same as \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}.\)</li>
<li>
<p>Using the same conditional probability argument as before, that same term can be decomposed into</p>
\[\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \} \leq \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i f(x_i) \leq \theta \} + \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 , y_i f(x_i) \leq \theta\}.\]
</li>
<li>Using Chernoff bounds shows the second term of the expression is small with high probability over \(g\). Thus, the \(\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0]\) is approximately \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}\), plus an error term that accumulates as a result of the concentration bounds.</li>
<li>Having a large \(\theta\) means that we have plenty of room for the Chernoff bounds over \(g\) to be strong, which corresponds to the <em>robustness</em> discussed before. If \(\theta\) were small, then it would be very easy to have \(yf(x) \leq 0\) and \(yg(x) \geq \theta/2\) simultaneously, which would make the argument impossible.</li>
</ul>
<h2 id="last-thoughts">Last thoughts</h2>
<p>I read these boosting papers in 2017 while taking my first graduate seminar, which surveyed a variety of papers in ML theory.
I enjoyed the papers then, but the remarkability of this generalization result was lost on me at the time.
Now, I find this much more exciting because it gives a setting where a model can obtain provably great generalization error despite overfitting the data and being “over-parameterized.” (If we count the number of parameters used in all of the classifiers that vote, there can be many more parameters than samples \(m\).)
The proof is elegant and does not require strange and adversarial distributions over training data.
Granted, the assumption that there exists a weak learner that always returns a classifier with error at most (say) 0.4 is a strong one, but the result is remarkable nonetheless.</p>
<p>Thanks for reading! Leave a comment if you have any thoughts or questions. (As long as the comments system isn’t buggy on your end–I’m still sorting out some issues.) See you next time!</p>Clayton Sanford[OPML#7] BLN20 & BS21: Smoothness and robustness of neural net interpolators2021-09-22T00:00:00+00:002021-09-22T00:00:00+00:00http://blog.claytonsanford.com/2021/09/22/bubeck<p><em>This is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>This post discusses two papers by Sebastian Bubeck and his collaborators that are of interest to the study of over-parameterized neural networks. The first, <a href="https://arxiv.org/abs/2009.14444" target="_blank">“A law of robustness for two-layers neural networks” (BLN20)</a> with Li and Nagaraj, gives a conjecture about the “robustness” of a two-layer neural network that interpolates all of the training data. The second, <a href="https://arxiv.org/abs/2105.12806" target="_blank">“A universal law of robustness via isoperimetry” (BS21)</a> with Sellke, proves part of the conjecture and extends that part of the conjecture to deeper neural networks.
The other part of the conjecture remains open for future work to tackle.</p>
<p>Both papers consider a setting where there are \(n\) training samples \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}\) drawn from some distribution that are fit by a neural network with \(k\) neurons.
For the two-layer case (which we’ll focus on in this writeup), they consider neural networks of the form</p>
\[f(x) = \sum_{j=1}^k u_j \sigma(w_j^T x + b_j),\]
<p>where \(\sigma(t) = \max(0, t)\) is the ReLU activation function and \(w_j \in \mathbb{R}^d\) and \(b_j, u_j \in \mathbb{R}\) are the parameters.
Roughly, they ask whether there exists a “smooth” neural network \(f\) such that \(f(x_i) \approx y_j\) for all \(j \in [n]\); this makes \(f\) an approximate interpolator.</p>
<p><em>How does this relate to the rest of this blog series?</em>
All of the other posts so far have been about cases where over-parameterized linear regression leads to favorable generalization performance.
These generalization results occur due to the smoothness of the linear prediction rule.
That is, if we have some prediction rule \(x \mapsto \beta^T x\) for \(x, \beta \in \mathbb{R}^d\) with \(d \gg n\), we might have good generalization if \(\|\beta\|_2\) is small, which is enabled when \(d\) is very large.
The same observation holds up with neural networks (over-parameterized models leads to benign overfitting), but it’s harder to prove why it leads to a small generalization error.
By understanding the smoothness of interpolating neural networks, it might make it easier to prove generalization bounds on the neural networks that perfectly fit the training data.</p>
<p><em>How do they measure smoothness?</em>
For linear regression, it’s natural to think of the smoothness of the prediction rule \(f_{\text{lin}}(x) = \beta^T x\) as \(\|\beta\|_2\), since that is the magnitude of the gradient \(\|\nabla f_{\text{lin}}(x)\|_2\) at every sample \(x\).
For two-layer neural networks—which are non-linear functions—it’s natural instead to consider the maximum norm of the gradient of \(f\), which is represented by the Lipschitz constant of \(f\): the minimum \(L\) such that \(|f(x) - f(x')| \leq L \|x - x'\|_2\) for all \(x, x'\). (Lipschitzness also comes up frequently in my <a href="/2021/08/15/hssv21.html" target="_blank">COLT paper about the approximation capabilities of shallow neural networks</a>.)</p>
<p><em>What does it have to do with robustness?</em>
Typically, robustness is discussed in the context of adversarial examples.
If you’ve hung around the ML community, you’ve probably seen this issue featured in images like this:</p>
<p><img src="/assets/images/2021-09-22-bubeck/panda.png" alt="" /></p>
<p>Here, an image of a panda is provided that a trained image classification neural network clearly identifies as such.
However, a small amount of noise can be added to the image that leads to the network being tricked into thinking that it’s a gibbon instead.
Put roughly, it means that the network outputs \(f(x) = \text{"panda"}\) and \(f(x + \epsilon \tilde{x}) = \text{"gibbon"}\) for some \(x\) and \(\tilde{x}\), which means that the output of \(f\) changes greatly near \(x\).
By mandating that \(f\) have a small Lipschitz constant, these kinds of fluctuations are impossible.
This makes the network \(f\) <em>robust</em>.
Thus, enforcing smoothness conditions is a way to ensure that a predictor is robust to these kinds of adversarial examples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/smooth.jpeg" alt="" /></p>
<p>As a result, Bubeck and his collaborators want to characterize the availability of interpolating networks \(f\) that are also robust, with the hopes of understanding how over-parameterization can be used to avoid having adversarial examples.</p>
<p>One important caveat: Unlike the previous papers discussed in this series, this one focuses only on approximation and not optimization.
It asks whether <em>there exists</em> an interpolating prediction rule that is smooth, but it does not ask whether this rule can be easily obtained from stochastic gradient descent.</p>
<p>For the rest of the post, I’ll discuss the conjecture made by BLN20, share the support for the conjecture that was provided by BLN20 and BS21, and discuss what remains to be studied in this space.</p>
<h2 id="the-conjecture">The conjecture</h2>
<p>For simplicity, BLN20 considers only samples drawn uniformly from the unit sphere: \(x \in \mathbb{S}^{d-1}= \{x \in \mathbb{R}^d: \|x\|_2=1\}\) with iid labels \(y_i \sim \text{Unif}(\{-1,1\})\).
The conjecture of BLN20, which combines their Conjectures 1 and 2 is as follows:</p>
<p><em>Consider some \(k \in [\frac{cn}{d}, Cn]\) for constants \(c\) and \(C\). With high probability over \(n\) random samples from some distribution, there exists a 2-layer neural network \(f\) of width \(k\) that perfectly fits the data such that \(f\) is \(O(\sqrt{n/k})\)-Lipschitz.
Furthermore, any neural network that fits the data must be \(\Omega(\sqrt{n/k})\)-Lipschitz with high probability.</em></p>
<p>If true, the conjecture suggests there can only be an \(O(1)\)-Lipschitz interpolating neural network \(f\) if the model is highly over-parameterized, or \(k = \Omega(n)\).
Note that \(k\) is the number of neurons, and not the number of parameters.
In the case of a 2-layer neural network, the number of parameters is \(p = kd\), so there must be at least \(p = \Omega(nd)\) parameters for the interpolating network to be smooth.</p>
<p>The conditions with constants \(c\) and \(C\) are necessary for the question to be well-posed.</p>
<ul>
<li>Without the \(k \leq Cn\) constraint, there theorem would imply the existence of neural networks that fit the data and are \(o(1)\)-Lipschitz. However, this is not possible unless all training samples are have the same label \(y_i\); otherwise, there are at least two different samples \(x_i\) and \(x_j\) that are at most distance 2 apart (since both lie on \(\mathbb{S}^{d-1}\)) and have opposite labels. This implies that any function fitting both samples must be at least 1-Lipschitz.</li>
<li>Without the \(k \geq \frac{cn}{d}\) constant, there is unlikely to exist any neural network with \(k\) neurons that can fit the \(n\) samples. Since the number of parameters \(p\) is roughly \(kd\), letting \(k \ll \frac{n}{d}\) would ensure that \(p \ll n\) and there are fewer parameters than samples. Intuitively, it’s difficult to fit a large number of points with random labels when there are fewer parameters than samples. This suggests that the model must be over-parameterized for interpolation to even occur in the first place, let alone be smooth.</li>
</ul>
<p>BLN20 shows that the conjecture holds up empirically on toy data.
For many values of \(n\) and \(k\), they train several neural networks to fit the \(n\) samples with 2-layer neural networks of width \(k\) and randomly sample gradients to find the one with the largest magnitude.
When plotted, they note a nice linear relationship between the norms of the largest random gradient and \(\sqrt{n/k}\).
Of course, the maximum random gradient is not the same as the Lipschitz constant, since it’s impossible to check the gradient for all values of \(x\) simultaneously, but this suggests that it’s likely that the conjecture is correct.</p>
<p><img src="/assets/images/2021-09-22-bubeck/plot.png" alt="" /></p>
<h2 id="partial-upper-bounds-from-bln20">Partial upper bounds from BLN20</h2>
<p>The BLN20 papers focuses on presenting the conjecture and giving a series of partial results that suggest it may be true. In this section, we give a brief summary of each of the partial solutions.</p>
<p>The following are all partial solutions to the upper bound. That is, they show weaker versions of the claim that there exists neural network \(f\) with Lipschitz constant \(O(\sqrt{n/ k})\) by showing either larger bounds on the Lipschitz constant or more restrictive parameter regimes.</p>
<ul>
<li><strong>The high-dimensional case (3.1).</strong> If \(d \gg n\), then a ReLU network with a single neuron \(k = 1\) can be used to perfectly fit the data.
This is because a single \(d\)-dimensional hyperplane will be able to fit the \(n\) samples, so one can just choose the hyperplane with the lowest magnitude that fits the data and use a ReLU that corresponds to that hyperplane. By similar analysis to that of linear regression, the Lipschitz constant of this network will be \(O(\sqrt{n})\) with high probability, which is the same as \(O(\sqrt{n/ k})\). This can’t be improved without using more neurons.
<img src="/assets/images/2021-09-22-bubeck/single.jpeg" alt="" /></li>
<li><strong>The wide (“optimal size”) regime: \(k = n\) (3.2).</strong> With high probability, an \(10\)-Lipschitz network \(f\) can be provided by using a ReLU for every sample. Each ReLU is treated as a “cap” that gives a sample the correct label. With high probability, the points will be sufficiently spread apart in \(\mathbb{S}^{d-1}\) to ensure that none of the the caps overlap. This makes the norm of the gradient never more than \(10\), if each cap is offset by \(\frac{1}{10}\).
<img src="/assets/images/2021-09-22-bubeck/cap.jpeg" alt="" /></li>
<li><strong>The compromise case (3.3).</strong> The two previous approaches can be combined for a broader choice of \(k\) and \(n\) by instead having each ReLU perfectly fit \(m := n/k \leq d\) samples in a cap. However, since these are bigger and more complex caps then before, we need to be more concerned about the caps overlapping. They show that \(O(m \log d)\) caps will overlap at any given point, which means that the Lipschitz constant will be \(O(n\log (d) / k)\). Even disregarding the logarithmic factor, this is still much weaker than the \(O(\sqrt{d/k})\) factor that the conjecture desires.
<img src="/assets/images/2021-09-22-bubeck/combo.jpeg" alt="" /></li>
<li><strong>The very low-dimensional case with a weird architecture (3.4).</strong>
They prove the existence of a neural network that fits \(n\) samples and has Lipschitz constant \(O(\sqrt{n / k})\) with high probability. To do so, however, they need several major caveats:
<ul>
<li>The dimension \(d\) is very small; for some constant even integer \(q\), \(k = C_q d^{q-1}\) and \(n \approx \frac{d^q}{100 q \log d}\), where \(C_q\) depends on \(q\). Note that the number of neurons \(k\) can be much bigger than the number of samples \(n\) when \(d\) is very small and \(q\) is large.</li>
<li>\(f\) approximately interpolates the samples. That is, \(\lvert f(x_i) - y_i\rvert \leq 0.1 C_q\) for all \(i \in [n]\). (Note that 0.1 can be replaced by \(\epsilon\) and the result can be generalized.)</li>
<li>The neural network uses the activations \(t \mapsto t^q\) and not the ReLU function.</li>
</ul>
<p>This can be thought of as a tensor interpolation problem. Specifically, for \(q = 2\), they perform regression on the space \(x^{\otimes 2} = (x_1^2, x_1x_2, \dots, x_1 x_d,\dots x_2x_1, x_2^2, \dots, x_d^2)\) using the quadratic activation function.
This approach gives the kind of bound they’re looking for, but is a strange enough case that it’s unclear how to extend this to networks with (1) high input dimensions, (2) perfect interpolation, and (3) ReLU activations.</p>
</li>
</ul>
<p>The paper also gives a few constrained versions of the lower bound on the Lipschitz constant for any interpolating function. However, we omit them here because the second paper—BS21—has much better lower bounds.</p>
<h2 id="lower-bound-from-bs21">Lower bound from BS21</h2>
<p>The follow-up paper proves a mostly-tight lower bound, which effectively resolves half of the conjecture.
The results require <em>isoperimetry</em> to hold, which is true of a random variable \(x \in \mathbb{R}^d\) if \(f(x)\) has subgaussian tails for every Lipschitz function \(f\).
This holds for well-known distribution such as (1) multivariate Gaussian distributions, (2) the uniform distribution on \(\mathbb{S}^{d-1}\), (3) and the uniform distribution on the hypercube \(\{-1, 1\}^d\).</p>
<p>By combining their Lemma 3.1 and Theorem 3, the following statement is true about 2-layer neural networks:</p>
<p><em>Let \(\mathcal{F}\) be a family of 2-layer neural networks of width \(k\) with parameters in \([-W, W]\). Suppose each sample \((x_i, y_i)\) is drawn from isoperimetric distribution for all \(i \in [n]\) with \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\) and such that \(\| x_i \|_2 \leq R\) almost surely. Then, with high probability, any neural network \(f \in \mathcal{F}\) that perfectly fits all \(n\) training samples will have a Lipschitz constant of</em></p>
\[\Omega\left(\sqrt{\frac{n}{k \log (W R nk)}}\right).\]
<p>This is close to the conjecture up to logarithmic factors! In addition, this result is more general in the paper:</p>
<ul>
<li>Instead of considering only depth-2 neural networks, they consider all parametric models that change by bounded amounts as their parameter vectors change.</li>
<li>Within their study of neural networks, their analysis also addresses networks that share parameters.</li>
<li>A parameter \(\epsilon\) allows them to conclude that all networks that <em>nearly interpolate</em> must have high Lipschitz constant, not just those that perfectly fit the data.</li>
</ul>
<p>They also account for the fact that the bounds on parameter weights with \(W\) are necessary. Through their Theorem 4, they show the existence of a neural network with a small Lipschitz constant that approximates nearly all of the samples with only a single parameter.
Thus, without these kinds of assumptions, the conjecture is rendered uninformative.</p>
<p>The proof works by considering some fixed \(L\)-Lipschitz function \(f\) and asking how likely it is that \(n\) random samples are almost perfectly fit by \(f\).
By isoperimetry, this can be shown to happen with very low probability.
Then, by making use of an \(\epsilon\)-net argument, one can show that no \(L\)-Lipschitz function \(f\) can perfectly fit the samples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/cover.jpeg" alt="" /></p>
<p>While I breezed over the argument here, it’s a relatively simple one that can be followed by most people with some background in concentration inequalities.</p>
<h2 id="further-questions">Further questions</h2>
<p>While the second paper resolves half of the open question from the first paper, the other half (the existence of a smooth interpolating neural network) remains open.</p>
<p>There are also a few caveats from the second paper that remain to be resolved. For one, it may be possible to loosen the restriction that there be non-zero label noise (i.e. \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\)).
In addition, the fact that \(\|x_i\|\) must always be bounded is a weakness, since it rules out Gaussian inputs; perhaps this could be improved.</p>
<p>Thanks for tuning in to this week’s blog post! See you next time!</p>Clayton SanfordThis is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam. Check out this post to get an overview of the topic and a list of what I’m reading.[OPML#6] XH19: On the number of variables to use in principal component regression2021-09-11T00:00:00+00:002021-09-11T00:00:00+00:00http://blog.claytonsanford.com/2021/09/11/xh19<!-- [XH19](https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf){:target="_blank"} [[OPML#6]](/2021/09/11/xh19.html){:target="_blank"} -->
<p><em>This is the 6th of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>Here’s another <a href="https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf" target="_blank">paper</a> by my advisor Daniel Hsu and his former student Ji (Mark) Xu that discusses when overfitting works in linear regression.
This one differs subtly from some of the previously discussed papers (like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>) in that it considers <em>principal component regression</em> (PCR) rather than least-squares regression.</p>
<h2 id="principal-component-regression">Principal component regression</h2>
<p>Suppose we have a collection of \(n\) samples \((x_i, y_i) \in \mathbb{R}^{N} \times \mathbb{R}\), which we collect in design matrix \(X \in \mathbb{R}^{n \times N}\) and label vector \(y \in \mathbb{R}^n\).
The standard approach to least-squares regression (which has been given numerous times on this blog) is to choose the \(\hat{\beta}_\textrm{LS} \in \mathbb{R}^N\) that minimizes \(X \hat{\beta}_\textrm{LS} = y\), breaking ties by minimizing the \(\ell_2\) norm \(\|\hat{\beta}_{\textrm{LS}}\|_2\).
This approach considers all dimensions of the inputs \(x_i\).</p>
<p>However, there might a situation where we know \(\Sigma\) a priori and only want to consider the directions in \(\mathbb{R}^N\) that the inputs meaningfully vary along.
This is where <a href="https://en.wikipedia.org/wiki/Principal_component_regression" target="_blank">principal component regression</a> comes in.
Instead of regressing on the training data itself, we regress on the \(p\) most significant dimensions of the data, as identified by <a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal component analysis</a> (PCA).
PCA is a linear dimensionality reduction method that obtains a lower-dimensional representation of \(X\) by approximating each sample as a linear combination of the \(p\) eigenvectors of \(X^T X\) with the largest corresponding eigenvalues.
These \(p\) eigenvectors correspond to the directions in \(\mathbb{R}^N\) where the samples in \(X\) have highest variance.
Moreover, projecting each of the \(n\) samples \(x_i\) onto the space spanned by these \(p\) eigenvectors provides the closest average \(\ell_2\)-approximation of each \(x_i\) as a linear combination of \(p\) fixed vectors in \(\mathbb{R}^N\).</p>
<p>Let \(\mathbb{E}[x_i] = 0\) and \(\Sigma = \mathbb{E}[x_i x_i^T]\) be the covariance matrix of \(x_i\).
If we know \(\Sigma\) ahead of time, then we can simplify things by using only the eigenvectors of \(\Sigma\), rather than the empirical principal components taken from eigenvectors of \(X^T X\).
If the \(p\) eigenvectors \(\Sigma\) with the largest eigenvalues are collected in \(V \in \mathbb{R}^{N \times p}\), then we can express the low-dimensional representation of the training samples as \(X V \in \mathbb{R}^{n \times p}\).
By applying linear regression to these new low-dimensional samples and transforming the resulting parameter vector back to \(\mathbb{R}^N\), we get the parameter vector \(\hat{\beta} = V(X V)^{\dagger} y\), where \(\dagger\) denotes the pseudo-inverse.
(On the other hand, the least-squares parameter vector is \(\hat{\beta}_\textrm{LS} = X^{\dagger} y\).)</p>
<p>The below image visualizes the differences between the least squares and PCR regression algorithms.
It shows a toy example where samples \((x, y)\) (in purple) vary greatly in one direction and not much at all in another direction.
PCR only considers the direction of maximum variance and rules the other out, while least squares considers all directions simultaneously.
Therefore, the hypotheses represented by the green hyperplanes look subtly different for each case.</p>
<p><img src="/assets/images/2021-09-11-xh19/vis.jpeg" alt="" /></p>
<p>Note that this formulation of PCR concerns an idealized setting.
Most regression tasks do not give the learner direct access to \(\Sigma\).
However, it’s possible that \(\Sigma\) could be separately estimated with \(\hat{\Sigma}\) and then applied by PCA.
They authors refer to this as “semi-supervised” because the \(\Sigma\) can be estimated with using only unlabeled samples, since none of the labels \(y\) are used in the approximation.
Due to the high cost of obtaining labeled data, a sufficient dataset for kind of estimate may be significantly easier to obtain than a dataset for the general learning task.</p>
<h2 id="learning-model-and-assumptions">Learning model and assumptions</h2>
<p>They make several restrictive assumptions.
The main purpose of this paper is to construct instances where favorable over-parameterization occurs for PCR, rather than exhaustively catalogue when it must occur.</p>
<p>They assume the samples \(x_i\) have independent Gaussian components and that labels \(y_i = \langle x_i, \beta\rangle\) have no noise.
\(\Sigma\) is a diagonal matrix (which must be the case because of the independent components of each \(x_i\)) with entries \(\lambda_1 > \dots > \lambda_N > 0\).
Therefore, PCR will only use the first \(p\) diagonal entries of \(\Sigma\) and the reduced-dimension version of each sample will merely be its first \(p\) entries.</p>
<p>One weird thing about this paper relative to others is that the true parameter vector \(\beta\) is chosen randomly.
This means it’s an “average-case” bound.
They justify this on the grounds that the ability to choose an arbitrary \(\beta\) could lead to all of the weight being put on the \(N-p\) components that will not be included the PCA’d version of \(X\).
This would make it impossible to have non-trivial error bounds.</p>
<h2 id="over-parameterization-and-pcr">Over-parameterization and PCR</h2>
<p>Now, we have three parameters to consider (\(N, p, n\)), rather than the two (\(p, n\)) typically considered in the previous works on over-parameterization.
As before, they think of over-parameterization as the ratio \(\gamma = \frac{p}{n}\), but they must also contend with the ratios \(\alpha = \frac{p}{N}\) (the fraction of dimensions preserved by PCA) and \(\rho = \frac{n}{N}\) (the ratio of samples to original dimension).</p>
<p>Like <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>, they consider what happens when \(N, p, n \to \infty\) and the ratios remain fixed.
Like BLLT19, their results study how over-parameterization is affected as the eigenvalues of \(\Sigma\) change.
In Section 2, they focus on eigenvalues \(\lambda_1, \dots, \lambda_N\) that decay predictably at a polynomial rate.
Theorems 1 and 2/3 characterize what happens to the expected error in the under-parameterized (\(\gamma \leq 1\)) and over-parameterized (\(\gamma > 1\)) respectively.</p>
<ul>
<li>Theorem 1 shows that the shape of the “classical” regime error curve is preserved in the under-parameterized regime, since it shows that the error decreases as \(\alpha\) increases for fixed \(\rho\), up to a point when it decreases until \(\alpha = \rho\) (equivalently, \(p = n\)).</li>
<li>Theorem 2 shows that the expected error in the interpolation regime \(p > n\) converges to some fixed risk quantity, which can be determined by evaluating an intergral and solving for some quantity.</li>
<li>Theorem 3 shows that for any polynomial rate of decay of the eigenvalues, double-descent will occur and the best interpolating prediction rule will perform better than the best “classical” prediction rule.
In the noisy setting, the best interpolating prediction rule will only outperform the best classical rule in the event that the rate of decay is no faster than \(\frac{1}{i}\).</li>
</ul>
<p>To recap, the optimal performance for PCR is obtained in the over-parameterized regime (with \(p > n\)) if and only if eigenvalues \(\lambda_1, \dots, \lambda_N\) decay slowly; rapid decay leads to optimality in the classical regime.
This echoes the results of BLLT19, which shows that too rapid a decay in eigenvalues causes poor performance in the over-paramterized regime (very-much-not-benign overfitting).
However, BLLT19 also requires that the rate of decay not be too slow, which is a non-issue in this regime.</p>
<p>One of the nice things about this paper–which will be expanded on in the weeks to come–is that it separates the number of parameters \(p\) from the dimension \(N\).
Talking about over-parameterization in linear regression is often awkward because the two quantities are coupled, and we are forced to ask whether favorable behavior in the over-parameterized regime is caused by the high dimension or the high parameter count.
We’ll further examine models with separate dimensions and parameter counts when we study random feature models.</p>Clayton Sanford