This is the fourth of a sequence of blog posts that summarize papers about over-parameterized ML models.

This week’s paper is one by Hastie, Montanari, Rosset, and Tibshirani, which studies the cases in over-parameterized least-squares regression where the generalization error is small. It follows in the vein of the papers reviewed so far (BHX19 [OPML#1], BLLT19 [OPML#2], and MVSS19 [OPML#3]), which all present circumstances where such “benign overfitting” takes place in \(\ell_2\)-norm minimizing linear regression.

This summary will be a bit shorter than the previous ones, since a lot of the ideas here have already been discussed. The paper is highly mathematically involved; it covers a lot of ground and gives theorems that are very general. However, the core message about when it’s possible for favorable interpolation to occur is similar to that of BLLT19, so I’ll mainly focus on presenting the results of this paper on a high level and explaining the similarities between the two papers.

The paper is also nearly seventy pages long, and there’s a lot of interesting content about non-linear models and mis-specified models (which generalizes the case of double-descent considered in BHX19) that I won’t discuss for the sake of brevity.

The paper differs from BLLT19 because it considers a broader range of data distributions (e.g. samples \(x_i\) need not be drawn from probability distribution with subgaussian tails) and because it lies in an asymptotic regime. Concretely, the three other papers previously considered give bounds in terms of the number of samples \(n\) and the number of parameters \(p\), where they are taken to be large, but not infinite. Here, we instead fix some ratio \(\gamma = \frac{p}{n} > 1\) to represent how over-parameterized the model is and ask what happens when \(n, p \to \infty\). This means that we’ll need to consider subtly different settings than I discussed in my post about BLLT19, because some of those have \(p = \infty\) and another has \(p = \Theta(n \log n)\). It’s necessary here to only consider numbers of parameters \(p\) that grow linearly with the number of samples \(n\).

Data model

The data model is mostly the same as the previous papers, minus the aforementioned differences in distributional assumptions and growth of \(n\) and \(p\).

We draw \(n\) random samples \((x_i, y_i) \in \mathbb{R}^p \times \mathbb{R}\) where \(x_i\) is drawn from distribution with mean \(\mathbb{E}[x_i] = 0\), covariance \(\mathbb{E}[x_i x_i^T] = \Sigma\), and bounded low-order moments. (This moment assumption is a weaker assumption to make than subgaussianity, which makes these results more impressive.) For some parameter vector, \(\beta \in \mathbb{R}^p\) and random noise \(\epsilon_i\) with variance \(\sigma\), the label \(y_i\) is set by taking \(y_i = \langle x_i, \beta \rangle + \epsilon_i\).

For simplicity, we’ll assume (as we have before) that \(\Sigma\) is a diagonal matrix with entries \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0\). That way, we can assume each coordinate \(x_{i,j}\) of \(x_i\) is drawn independently and we can consider how the output of least-squares regression is affected by the variances \(\lambda_1, \dots, \lambda_p\). (The paper allows \(\Sigma\) to be any symmetric positive definite matrix, and it instead considers the output of least-squares regression in terms of the eigenvalues of \(\Sigma\), rather than the variances of each independent component.)

As mentioned above, for some fixed over-parameterization ratio \(\gamma > 1\), we’ll let \(p = \gamma n\) and let \(n \to \infty\).

Given a training sample collected in input matrix \(X\) and label vector \(y\), the solution to minimum-norm least-squares is the \(\hat{\beta} \in \mathbb{R}^p\) that minimizes \(\|\hat{\beta}\|_2\) and interpolates the training samples: \(X \hat{\beta} = y\). The goal—like in other papers about over-parameterized least-squares regression—is to bound the expected squared risk of the prediction rule \(\hat{\beta}\) on a new sample \(x\):

\[R_X(\hat{\beta}; \beta) = \mathbb{E}_{x}[(\langle x, \hat{\beta}\rangle - \langle x, \beta\rangle)^2].\]

Like in BLLT19, the analysis works by decomposing this risk into a bias term \(B_X(\hat{\beta}; \beta)\) and a variance term \(V_X(\hat{\beta}; \beta)\).

Main result

Their main result is Theorem 2, which shows that as \(n\) and \(p\) become arbitarily large, the bias \(B_X\) and variance \(V_X\) converge to predicted bias \(\mathscr{B}\) and predicted variance \(\mathscr{V}\). For this bound to hold, they require that for some constant \(M\) that does not depend on \(n\), the largest component variance has \(\lambda_1 \leq M\) and the smallest has \(\lambda_p \geq \frac{1}{M}\).

This means that the variances cannot decay to zero like BLLT19 relies on! It will still matter that some variances be significantly smaller than others, but not in the same way.

Now, I’ll define the predicted bias and variance and try to explain the inutution behind them:

\[\mathscr{B}(\gamma) = \left(1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\right) \sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]

and

\[\mathscr{V}(\gamma) = \sigma^2 \gamma \mathbf{c_0} \frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}},\]

where \(c_0\) depends on \(\gamma\) and satisfies

\[1 - \frac{1}{\gamma} = \frac{1}{p} \sum_{j=1}^p \frac{1}{1 + \gamma c_0 \lambda_j}.\]

Note: The predicted variance differs from the version presented in the paper. I additionally include the bolded \(c_0\) term, which I suspect was left out as a typo. In its current version, the bound in Theorem 2 is inconsistent with the specialized bound in Theorem 1, so I suspect that it was just an omission of a variable in the variance statement.

If you’re anything like me, you find these expressions a little terrifying and hard to understand. Let’s break them down into pieces to try to grasp how the value of \(\gamma\) affects the risk as \(n\) and \(p\) become large.

The rough intuition for the impact of over-parameterization on these two terms is that growth of \(\gamma\) hurts bias and helps variance. However, this doesn’t seem immediately obvious; indeed, the variance appears to grow as \(\gamma\) increases. It’s necessary to understand the product \(\gamma c_0\) in order to get why this is the case. We’ll first consider a simple isotropic case to understand what happens to that term, and then hand-wavily revisit the general case.

Isotropic data

For simplicity, consider the isotropic or spherical case, where \(\Sigma = I_p\) and \(\lambda_1 = \dots = \lambda_p = 1\). (“Isotropic” literally translates to “equal change.”) Then, taking \(c_0 := \frac{1}{\gamma(\gamma - 1)}\) satisfies the condition on \(c_0\). Now, we can plug in \(c_0 \gamma = \frac{1}{\gamma - 1}\) into the expressions for predicted bias and predicted variance:

\[\mathscr{B}(\gamma) = \left(1 + \frac{1}{\gamma - 1} \right) \frac{1}{(1 + \frac{1}{\gamma-1})^2} \sum_{j=1}^p \beta_j^2 = \frac{\|\beta\|_2^2}{1 - \frac{1}{\gamma -1}} = \frac{\|\beta\|_2^2( \gamma-1)}{\gamma}.\] \[\mathscr{V}(\gamma) = \frac{\sigma^2}{\gamma - 1}.\]

Thus, as \(\gamma\) becomes larger (and the learning model becomes more over-parameterized), the bias will approach \(\|\beta\|^2\) and the variance will approach zero. This isn’t really good new for the isotropic case… The bias rapidly approaches \(\|\beta\|^2\) as \(\gamma\) grows, which will make it impossible for the risk to be small.

It’s possible for the excess risk to decrease as \(\gamma\) grows in the case where the signal to noise ratio \(\frac{\|\beta\|^2}{\sigma^2}\) is large, but the excess risk will still be worse than it would be in parts of the classical regime where \(\gamma < 1\).

As with BLLT19, to see the benefits of overfitting, we need to look at how the variances decay in the anisotropic setting.

Intuition for the general case

We’ll continue to think of \(c_0 \gamma\) as something that decays to zero as \(\gamma\) becomes large. If that weren’t the case and \(c_0 \gamma\) were large, then each term \(\frac{1}{1 + \gamma c_0 \lambda_j}\) would be small for \(1 \leq j \leq p\), and it’s impossible for their average to be \(1 - \frac{1}{\gamma}\), since that’s close to one.

Now, we’ll talk through each of the components of the predicted bias and variance to speculate in a hand-wavy way about how this result applies.

Let’s start with the variance term.

  • First, if we think of \(\gamma c_0\) as something like \(\frac{1}{\gamma - 1}\) (or at least something that decays as \(\gamma\) increases), then the variance goes to zero as the model becomes more over-parameterized. This checks out with our intuition from BLLT19 and BHX19.
  • Also intuitively, the variance drops if the noise \(\sigma\) drops. If there’s no noise, then all of the model’s error will come from the bias.
  • Now, the hard part.

    \[\frac{\sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{\sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]

    will be thought of roughly as corresponding to the rate of component variance decay. Since \(\gamma c_0\) is small and the variances \(\lambda_j\) are bounded above, most of the \((1 + \gamma c_0 \lambda_j)^2\) terms should be close to 1. Making that sketchy simplification, we instead have

    \[\frac{\sum_{j=1}^p \lambda_j^2}{\sum_{j=1}^p \lambda_j}.\]

    This looks sorta similar to the \(R_0(\Sigma)\) term from BLLT, except that it would square the denominator. The term (and hence, the variance) is small when there’s a gap between the high-variance components and the low-variance components, or when some \(\lambda_j\)’s are much larger than other \(\lambda_j\)’s. This corresponds to the requirement from BLLT19 that the decay must be sufficiently fast.

Thus, you get small variance if there’s some combination of heavy over-parameterization, low noise, and rapid decay of variances. Now, we look at bias.

  • The first term

    \[1 + \gamma c_0 \frac{ \sum_{j=1}^p \frac{\lambda_j^2}{(1 + \gamma c_0 \lambda_j)^2}}{ \sum_{j=1}^p \frac{\lambda_j}{(1 + \gamma c_0 \lambda_j)^2}}\]

    will be roughly 1 when the model is over-parameterized (because of \(\gamma c_0\)) or when the variances \(\lambda_i\) drop sufficiently fast.

  • The final term

    \[\sum_{j=1}^p \frac{\beta_j^2 \lambda_j}{(1 + \gamma c_0 \lambda_j)^2}\]

    looks at the correlations between “important” directions in the true parameters \(\beta\) and variances \(\lambda_j\). If we again treat \((1 + \gamma c_0 \lambda_j)^2 \approx 1\), then this term is \(\sum_{j=1}^p \beta_j^2 \lambda_j\). This is approximately \(\|\beta\|^2\) (and thus large) if most of the weight of \(\beta\) lies in high-variance directions. It will then be small if a sufficiently the weight of \(\beta\) is divided into many medium-importance components. This seems analogous to the BLLT19 requirement that the decay of weights not be too rapid.

Thanks for reading this blog post! As always, let me know if you have thoughts or feedback. (As of now, there’s no way to comment on the blog. My original attempt with Disqus led to the introduction of a bunch of terrible ads to this blog. I’ll be back with something soon, which will hopefully be less toxic.)