Jekyll2022-01-03T06:30:44+00:00http://blog.claytonsanford.com/feed.xmlClayton’s BlogRandom thoughts about machine learning, algorithms, math, running, living in New York, and more.Clayton SanfordBooks of 20212022-01-03T00:00:00+00:002022-01-03T00:00:00+00:00http://blog.claytonsanford.com/2022/01/03/books<p>Happy new year!
One of my primary goals of 2021 was to create this blog, which I actually managed to achieve.
In 2022, I intend to continue writing on this blog and to also post content that can be read by anyone, not just people who do research on machine learning theory.
As a first attempt to write something that’ll appeal to non-computer scientists, here’s a quick post listing and commenting on the non-technical books I read last year.</p>
<p>Stars [*] indicate a book was chosen by a book club.
I thoroughly enjoyed nearly every book on the list, but those with dagger [\(^\dagger\)] marcations are those I’d particularly recommend.
If you have any thoughts, questions, or book recommendations, feel free to comment below or email me.</p>
<p>For the early books on the list, I read them months ago, so some of my comments are a bit rusty.</p>
<ol>
<li><a href="#diamond-age-neal-stephenson"><em>Diamond Age</em>, Neal Stephenson</a></li>
<li><a href="#antifragile-nassim-nicholas-taleb"><em>Antifragile</em>, Nassim Nicholas Taleb</a></li>
<li><a href="#the-death-and-life-of-great-american-cities-jane-jacobs"><em>The Death and Life of Great American Cities</em>, Jane Jacobs</a></li>
<li><a href="#the-new-me-halle-butler"><em>The New Me</em>*, Halle Butler</a></li>
<li><a href="#the-midnight-library-matt-haig"><em>The Midnight Library</em>*, Matt Haig</a></li>
<li><a href="#pachinkodagger-min-jin-lee"><em>Pachinko</em>\(^\dagger\), Min Jin Lee</a></li>
<li><a href="#a-burning-megha-majumdar"><em>A Burning</em>, Megha Majumdar</a></li>
<li><a href="#new-york-2140-kim-stanley-robinson"><em>New York 2140</em>*, Kim Stanley Robinson</a></li>
<li><a href="#being-mortaldagger-atul-gawande"><em>Being Mortal</em>\(^\dagger\), Atul Gawande</a></li>
<li><a href="#the-man-who-mistook-his-wife-for-a-hat-oliver-sacks"><em>The Man Who Mistook His Wife for a Hat</em>*, Oliver Sacks</a></li>
<li><a href="#guns-germs-and-steel-jared-diamond"><em>Guns, Germs, and Steel</em>, Jared Diamond</a></li>
<li><a href="#drive-your-plow-over-the-bones-of-the-deaddagger-olga-tokarczuk-reread"><em>Drive Your Plow over the Bones of the Dead</em>\(^\dagger\), Olga Tokarczuk (reread) </a></li>
<li><a href="#a-brief-history-of-seven-killings-marlon-james"><em>A Brief History of Seven Killings</em>*, Marlon James</a></li>
<li><a href="#red-white-and-royal-blue-casey-mcquiston"><em>Red, White, and Royal Blue</em>, Casey McQuiston</a></li>
<li><a href="#the-smallest-light-in-the-universedagger-sarah-seager"><em>The Smallest Light in the Universe</em>\(^\dagger\), Sarah Seager </a></li>
<li><a href="#exhalationdagger-ted-chiang"><em>Exhalation</em>*\(^\dagger\), Ted Chiang</a></li>
<li><a href="#fun-homedagger-alison-bechdel"><em>Fun Home</em>*\(^\dagger\), Alison Bechdel</a></li>
<li><a href="#the-vegetarian-han-kang"><em>The Vegetarian</em>*, Han Kang</a></li>
<li><a href="#the-plague-albert-camus"><em>The Plague</em>, Albert Camus</a></li>
<li><a href="#the-remains-of-the-daydagger-kazuo-ishiguro"><em>The Remains of the Day</em>\(^\dagger\), Kazuo Ishiguro</a></li>
<li><a href="#conversations-with-friends-sally-rooney"><em>Conversations with Friends</em>, Sally Rooney</a></li>
</ol>
<h2 id="diamond-age-neal-stephenson"><em>Diamond Age</em>, Neal Stephenson</h2>
<p><img src="/assets/images/2022-01-03-books/diamond.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 499 pages, published in 2000.</p>
<p>Recommended by a friend, <em>Diamond Age</em> is a sci-fi novel about a future world whose society with advanced nano-technology is fragmented into discrete warring cultures within cities.
His world building is fascinating; he explores the implications of ubiquitous nano-technology, where every household has a “matter compiler” that creates household objects as needed from “the Feed” of molecules distributed from a central source.
Stephenson’s world is sharply divided by global “phyles,” global cultures that are sovereign in certain districts of different cities across the world and impose their own sets of rules.
Some of the main characters belong to a “Neo-Victorian” phyle that controls a section of Shanghai, alongside the nearby “Han” and “Nippon” phyles, and the book comments on losses in translation between cultures and the benefits and challenges of more centralized and decentralized ways of organizing societies. (He also follows a hierarchical faction of hackers that aim to restructure the world as is.)</p>
<p>The book follows several plotlines and a collection of loosely-connected characters.
My favorite chapters were those that followed Nell (a young girl from a very poor background) as she interacts with the <em>Young Lady’s Illustrated Primer</em>, an interactive story-book that adapts to her real-world challenges to teach her independent thinking in a world where most people have become extremely passive.
There are plenty of strange asides and tangents (and a few plotlines at the end of the book involve unnecessary sexual violence), and some of the characters are less interesting than the ideas they represent; nonetheless, I found it to be a great read.</p>
<!-- There are a ton of ideas here, and I'd love to reread the book to catch more details about the world he creates.
At times, the ideas Stephenson discusses are more interesting than the characters he creates, but it's still a great read.
-->
<p><a href="#">[top of page]</a></p>
<h2 id="antifragile-nassim-nicholas-taleb"><em>Antifragile</em>, Nassim Nicholas Taleb</h2>
<p><img src="/assets/images/2022-01-03-books/antifragile.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 519 pages, published in 2012.</p>
<p>I read <em>The Black Swan</em> by Taleb in 2020, which focuses on the idea that we (academics, world leaders, everyday people) repeatedly fail to consider “black swan” events, low-probability and high-significance events (like a terrorist attack or a pandemic) are difficult to model, yet occur frequently enough to frequently catch us unprepared.
He excoriates and ridicules so-called experts caught unaware by unexpected disasters because their models assume that risks roughly follow a <a href="https://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a>, where deviation from some mean value is uncommon; such models are good for considering natural phenomena (like heights of people and the number of people who die from heart disease every year), but perform terribly for phenomena where the interesting behavior exists far from the mean (such as the winner-take-all dynamics of wealth accumulation and the death tolls of wars).</p>
<p><em>Antifragile</em> is his follow-up on <em>The Black Swan</em>, which focuses less on diagnosing the problem and more on constructing strategies that best handle these rare events.
He argues that systems (e.g. investment strategies, research agendas, foreign policies, health regimens) should be designed to not only handle uncertainty but grow stronger in the face of rare events and volatility.
They do so by designing responses that are nonlinear; for instance, he suggests an “barbell” investment strategy that invests most of a portfolio in extremely stable investments (think: bonds) and a small amount in ventures with low-probabilities of astronomical success.
Doing so creates a “convex” response that bounds the amount of harm one suffers from failure while being open to extreme amounts of success. (Notably, Taleb has employed such strategies to great success, making lots of money by betting against the housing bubble that preceded the Great Recession and all the “suckers” who believed in it.)</p>
<p>Taleb regards antifragility as a broad life philosophy that extends well beyond just investment, frequently invoking classical scholars and focusing on a wide range of applications.
He ruthlessly criticizes those whose value systems are fragile to uncertainty, especially those who lack skin in the game and who are not the ones directly harmed by the failures of their models.
He reserves intense criticism for academics, most of whose research he deems at best ineffectual (since he believes innovation to come from tinkering and dealing with uncertainty directly) and at worst evil (from being held unaccountable when their models are deployed at a large scale and fail to work).
He made points that on occasion made me feel directly under attack.</p>
<p>Personally, I think his pugilistic style worked much better in <em>The Black Swan</em>, where his aim is to criticize, than in <em>Antifragile</em>, where his goals are more constructive.
I found his antifragility framework compelling in the areas where he has expertise (such as investment), but less so when he discusses topics like nutrition and politics, where his expertise is less apparent and he relies more heavily on ad hominem attacks on those who disagree with him.
His frequent assumptions of bad faith and incompetence grated on me intensely; at the same time, I certainly grew as a thinker by reading both of his books, and I indend to read more of his works later on.</p>
<p>If you can stand Taleb’s writing style, I’d certainly recommend the book.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-death-and-life-of-great-american-cities-jane-jacobs"><em>The Death and Life of Great American Cities</em>, Jane Jacobs</h2>
<p><img src="/assets/images/2022-01-03-books/death.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 458 pages, published in 1961.</p>
<p>As a relatively new New Yorker who geeks out on trains, loves walking long distances around the city, and has opinions on housing and density in cities, Jane Jacobs was a must-read, and I really enjoyed her polemic on urban planning.
Set in <a href="https://en.wikipedia.org/wiki/Robert_Moses">Robert Moses’s</a> NYC, where neighborhoods were routinely razed without the consent of their inhabitants to build freeways, Jacobs argues that centralized urban planning fails to create lively and safe neighborhoods and argues for mixed-use development and public input.
She argues that cities and neighborhoods are dynamic and a lack of respect for their social fabric (by, say, replacing a dense immigrant neighborhood with an active street life and a mix of apartments and shops with a sterile apartment building surrounded by lawn) harms residents and raises crime.
She criticizes central planners like Moses for failing to understand the dynamics of the neighborhoods they interrupt, and suggests that mixed-use development–where houses, storefronts, schools, and restaurants coexist–keep neighborhoods safe by having varieties of people passing through at different times of the day for different purposes.</p>
<p>It was particularly fun to listen to this as an audiobook while walking around the city. She frequently singles out particular blocks, parks, and neighborhoods in New York for praise or criticism, and often they’re similar enough 60 years later for me to see her thoughts with my own eyes.
Her arguments on mixed-use development are very prescient, and many modern developments in NYC and other cities obey the principles she lays out.
At the same time, her arguments for local control and for the voices of neighbors to be accounted for arguably have been too successful; many of the key limitations faced by cities when planning are vocal neighborhood groups that weaponize hearings and public comment windows that prevent cities from building projects that improve public transportation and increase affordability.</p>
<p>I thought there was an interesting overlap between Jacobs and Taleb’s books.
Jacobs’ arguments against projects that serve only a single function and splinter neighborhoods can be framed as a criticism of Moses-style urban planning as being too “fragile.”
Her advocacy for multi-use projects claims that having housing and schools and stores and restaurants in the same neighborhoods prevents any blind-spots in safety that occur when only one function is present; this seems to mesh nicely with Taleb’s strategies for avoiding susceptibility to uncommon events.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-new-me-halle-butler"><em>The New Me</em>*, Halle Butler</h2>
<p><img src="/assets/images/2022-01-03-books/new-me.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2020.</p>
<p>This was the first book club book I read this year, and it’s a bit of an odd one.
It follows Millie, a depressed 30-something Millennial woman living in a city working at a temp job.
She suffers from extreme burnout, has terrible hygiene, and hates her friends.
The book documents the sense of unfulfilled expectations and directionlessness faced by privileged young adults who were told they were special.
Some of the best scenes involved Millie extrapolating how her coworkers live based on a few observations.
It also does a good job documenting the cyclic process of seeking out life changes that will make one happier, succeeding temporarily in making those improvements, and then losing hold of them and sinking back to a depression.
If you can accept that Millie will be exhausting and aggravating at times, it’s a nice read.</p>
<p>I haven’t read <em>My Year of Rest and Relaxation</em>, but apparently this is very similar.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-midnight-library-matt-haig"><em>The Midnight Library</em>*, Matt Haig</h2>
<p><img src="/assets/images/2022-01-03-books/midnight.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 193 pages, published in 2019.</p>
<p>A young woman finds nothing in her life to be going her way–her cat has died; her family is estranged; she is fired; her love life is non-existent; and her dreams of being a rockstar, a glaciologist, or a swimmer are unmet–and she decides to take her own life.
Instead of dying, however, she finds herself in a library where she has the opportunity to visit alternate versions of her life had she made different decisions at different points in her life.
The result is a series of anecdotes of her experiencing different versions of herself and observing what changes and what stays the same.
The book is a little predictable and can be saccharine at times, but it’s a very nice exploration of the main character, and it has a nice ending.</p>
<p>Personally, I was annoyed by the parts of the book that tried to talk about multiverse theory as an explanation for her ability to explore the different timelines; multiverses seems so overused at this point, and I think the author could have just presented with ability to transfer between lifetimes as fact without needing to get all pseudo-scientific. But that’s just a silly thing that bugs me; I overall really liked the book.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="pachinkodagger-min-jin-lee"><em>Pachinko</em>\(^\dagger\), Min Jin Lee</h2>
<p><img src="/assets/images/2022-01-03-books/pachinko.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 496 pages, published in 2017.</p>
<p><em>Pachinko</em> follows four generations of a family of Koreans who immigrate to Japan during World War II over several decades.
The book was historically informative to me; I’d had no idea of the scale of the atrocities Japan committed against Korean people and the extent to the racism ethically-Korean people living in Japan face.
The book primarily follows the life of Sunja, who as a teenager has an affair with a wealthy older man and becomes pregnant, but refuses to go with him when she learns he is married.
She instead marries a poor Christian missionary, who emigrates to Japan, and the novel follows her challenges to live in Japan and the divergent paths of her two sons.
Lee’s characters are very compelling and she tells a great epic about coming of age and power dynamics in a land where one does not belong.
The title refers to a casino game in Japan, whose parlors are often owned and operated by Koreans; pachinko goes on to represent both how Koreans living in Japan can rise in a foreign society, but also the cultural barriers that prevent any real form of equality.</p>
<p>I was completely immersed by this book, and I’d strongly recommend it.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="a-burning-megha-majumdar"><em>A Burning</em>, Megha Majumdar</h2>
<p><img src="/assets/images/2022-01-03-books/burning.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2020.</p>
<p>The novel is set in India amidst rising Hindu nationalism (as there is today under Modi) and follows three characters enmeshed in different ways in a terrorist attack that destroyed a train and killed numerous people.
Jivan is a Muslim girl from a poor background struggling to get ahead who is baselessly framed as an accomplice of the attack.
PT Sir is a P.E. teacher who once taught Jivan and sees a pathway to a more notable life for himself by working in a Hindu nationalist political party that aims to capitalize politically on Jivan’s case.
Lovely is a singer who is a <a href="https://en.wikipedia.org/wiki/Hijra_(South_Asia)">hijra</a> and has proof of Jivan’s innocence.
The book is an interesting exploration of their intersecting stories and what happens when getting ahead requires moral compromises.
I found the book to be a good read; the politics of it were a bit blunt and in-your-face, but the characterizations were good.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="new-york-2140-kim-stanley-robinson"><em>New York 2140</em>*, Kim Stanley Robinson</h2>
<p><img src="/assets/images/2022-01-03-books/new-york.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 624 pages, published in 2017.</p>
<p>The title here is pretty literal; the book imagines New York City in 2140 after being ravaged by climate change and having sea levels rise fifty feet.
After neglecting climate change for years, humans in Robinson’s book are shocked by the rapidness of sea level rise and are forced to rapidly adapt when it happens.
The main characters live in the famous Metlife building in Midtown, which is partially submerged.
People navigate primarily by boat and buildings frequently collapse due to structural damage, while the parts of the city that are above water (like my own neighborhood Morningside Heights) have become enclaves for extremely rich real estate developments.
As with <em>Diamond Age</em> and many other sci-fi books, the concepts developed are often much more interesting than the characters, who often seem like mouthpieces for ideologies rather than three-dimensional people.
However, the concepts explored are fascinating (amphibious people living mostly underwater, the consequence of climate refugees, reality stars trying to save polar bears from extinction), and Robinson seems to have done his homework on the science.</p>
<p>The book’s message is a broader critique of capitalism that goes beyond its failure to slow climate change.
Robinson’s world is one where everything is financialized (a character makes his living betting on the level of sea level rise in different locations), where the division of wealth is even more extreme, where private security who protect assets carry more power than police, and there are very few jobs in the “real” economy but tons of lawyers and investors.
He focuses heavily on the 2008 financial crisis (which loses a bit of immersion, since it seems unrealistic to have characters fixate on something 132 years ago), and the characters work to achieve their goals by working in a tenant’s union and ultimately organizing electorally.
All in all, an interesting (albeit, dense and occasionally dry) book that’s more a critique of finance than of fossil fuels.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="being-mortaldagger-atul-gawande"><em>Being Mortal</em>\(^\dagger\), Atul Gawande</h2>
<p><img src="/assets/images/2022-01-03-books/being.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 282 pages, published in 2014.</p>
<p>Twice, I’d told a friend in medical school that I’d read <a href="https://www.goodreads.com/en/book/show/25899336-when-breath-becomes-air"><em>When Breath Becomes Air</em></a>, and twice, they’d told me that <em>Being Mortal</em> provides a better meditation of medicine and mortality.
I finally decided to read it, and I found it fascinating.
Gawande focuses on the challenges posed by aging and criticizes how many of our life decisions focus on extending life at the cost of independence, comfort, and individuality.
He is particularly critical of how many decisions about aging and death have shifted from being nuanced cultural discussions to purely medical choices focused on a patient’s survival over their own desires; fortunately, he believes the tide to be turning and finds promise in recent changes to elder care and hospice care.
He makes his case by presenting a wide range of anecdotes about aging and dying people alongside well-researched arguments.
While most of the book is not a memoir, the book closes with a very compelling discussion on his own father’s process of dying and the delicate decisions that were made by his father, his family, and his doctors.</p>
<p>Quotes from books rarely stick with me, but one did from this one that encapsulates the challenges we often face with aging and death: “We want autonomy for ourselves, but safety for those we love.”
I appreciated the book because it grapples with the complexity of the questions posed by aging and dying without claiming to have easy answers to them.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-man-who-mistook-his-wife-for-a-hat-oliver-sacks"><em>The Man Who Mistook His Wife for a Hat</em>*, Oliver Sacks</h2>
<p><img src="/assets/images/2022-01-03-books/man.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 243 pages, published in 1985.</p>
<p>The book consists of a series of essays about patients the author has treated as a neurologist.
In one essay, a patient can no longer create new memories, and he remembers none of the previous several decades; his memory of his life during World War II remains sharp, and he does not realize that he and his family have aged.
In another, a musical teacher has visual agnosia and fails to recognize objects, despite having working vision (and hence mistakes his wife for a hat); the essay is about finding a way to focus on the music that brings him joy despite the loss of visual capabilities.
Most of the stories are more about the humanity of those who have these rare conditions, rather than a rigorous medical discussion.</p>
<p>The book club meeting focused a lot on the dissonance we felt while reading the book; for the time period when Sacks published the book, the essays are a far cry from many other works that would ridicule or dehumanize these people.
However, Sacks’s language doesn’t hold up for the modern world, and certain words and descriptions come across as overly harsh.
While his book as written seems crass at times to our group, we recognized that it’s still a strong step forward for its time.</p>
<p>I was particularly affected by one of the later essays, called “The Twins,” which focuses on two autistic twins with extremely sophisticated mathematical intuitions.
The two could memorize huge enormous numbers and factor those numbers in their head.
They often communicated by sharing prime numbers to one another, which conveyed actual messages and emotions.
They appeared to have a deep intuitive understanding of prime numbers that almost everyone else lacks.
As a mathematician (sorta) myself, it made me reflect upon how rudimental my mathematical intuition is, and how poorly trained my brain is to do the job I have.
I can keep maybe 4 or 5 variables in my head at once and visualize no more than three dimensions; these twins he profiles clearly had a richer intuition that could not be explained, and I found myself thinking of how limited my own perception of math is by the way my brain is configured.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="guns-germs-and-steel-jared-diamond"><em>Guns, Germs, and Steel</em>, Jared Diamond</h2>
<p><img src="/assets/images/2022-01-03-books/guns.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 498 pages, published in 1997.</p>
<p>Okay, so this is a “big history” book (like <em>Sapiens</em>) where an expert in one field comes up with a big unifying theory that claims to explain the main trends in history from <em>homo habilis</em> to 2022.
Despite the pitfalls of genre–the oversimplification of complex events to fit a narrative that can be presented in 500 pages–I think the book’s core argument is a good one.
Diamond argues that geography is to the nature and extent of civilizational growth in different regions and helps explain why certain civilizations dominated others historically.
I found the most interesting parts of the book to be his discussions on agriculture: that annual grasses are the easiest to domesticate, that there are relatively few animals that can be domesticated at all, and that the selection of domesticable plants in the Americas and in Australia were too nutrient-poor to make it possible to abandon the hunter-gatherer way of life.
Given Diamond’s background in evolutionary biology, geography, and physiology, I’m most persuaded by his arguments about the role nearby plant and animal species played in the development of early agricultural cultures and their ability to form empires.</p>
<p>Parts of the book seemed rather obvious to me, but perhaps that speaks to its success.
My high school history teachers often assigned this book to classes, so its ideas about geographic influence on history were likely already indirectly incorporated by my teachers into lessons.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="drive-your-plow-over-the-bones-of-the-deaddagger-olga-tokarczuk-reread"><em>Drive Your Plow over the Bones of the Dead</em>\(^\dagger\), Olga Tokarczuk (reread)</h2>
<p><img src="/assets/images/2022-01-03-books/drive.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 318 pages, published in 2009.</p>
<p>This was the first book I read for my book club in 2019, and I revisited it while traveling in Colorado for a conference this summer.
<em>Drive Your Plow</em> follows Janina, an older Polish woman who maintains houses in a small village, cares for wild animals, translates William Blake, studies astrology, teaches English to children, cooks vegetarian food, and builds up an entourage of misfit adults.
She’s a bit of an unreliable narrator who periodically omits key details, but she’s easy to love nonetheless.</p>
<p>The book reflects on militant vegetarianism and the invisibility of older women.
It does a great job at highlighting the anguish one must feel when all others are seemingly blind to what is transparently injust.
The novel defies genre and lurches between cozy moments and grisly murders (a character is eaten alive by beatles).
I spent my first read trying to figure out whether the book was magical realism or not, and the ambiguity made it a more gripping read.
The translation from Polish is also incredible; there’s a particular scene where Janina is attempting to figure out the right Polish phrase for an English passage, which means the translator must have come up with multiple plausible English translations of Polish translations of an English text.</p>
<p>I liked it well enough my first time through, but I thoroughly enjoyed it the second.
Knowing the general contours of the plot made it possible for me to catch more of her biting and witty observations of those around her and to better understand the nature and intensity of Janina’s rage.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="a-brief-history-of-seven-killings-marlon-james"><em>A Brief History of Seven Killings</em>*, Marlon James</h2>
<p><img src="/assets/images/2022-01-03-books/killings.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 688 pages, published in 2014.</p>
<p>Before performing a peace concert in Jamaica in 1976, Bob Marley was shot by Jamaican gangsters in his house.
He survived the gunshots and performed the concert two days later.
The novel is divided into five sections, each of which covering the events of a single day between 1976 and 1991 in Jamaica and NYC from the perspectives of the gangsters involved, a lover of Marley’s, a CIA agent, and an American reporter.</p>
<p>The book is seriously gruesome.
People are beheaded, shot in broad daylight, and die in overdoses.
A man is buried alive, and the chapter is narrated from the perspective of the man buried.
Some of the plotlines meander away from the core plot: one of the gangsters has anonymous sex with men while denying his homosexuality; a Jamaican-born caregiver exchanges raunchy jokes with the old white man she works for.
It’s at times hard to keep track of all of the characters, but it’s a pretty neat tapestry of the far-reaching effects of a single day once you can follow it.
It’s a commitment to read, but the book has fascinating characters and touches on a wide range of themes if you can stomach it.</p>
<p>As someone who didn’t know much of anything about Jamaica prior to reading the book, I found it highly educational, especially with respect to Jamaican slang; you’ll encounter words like “bomboclat” frequently, and I think it’s easier to listen to the audiobook, where most chapters are narrated by people with Jamaican accents.
Also, this book introduced me to <a href="https://en.wikipedia.org/wiki/Griselda_Blanco">Griselda Blanco</a>, who has one of the crazier Wikipedia pages I’ve read.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="red-white-and-royal-blue-casey-mcquiston"><em>Red, White, and Royal Blue</em>, Casey McQuiston</h2>
<p><img src="/assets/images/2022-01-03-books/red.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 421 pages, published in 2019.</p>
<p>After reading 700 pages of Jamaican gangsters murdering people in cold blood, I needed something light and cute to devour and bring my mood back.
Enter <em>Red, White, and Royal Blue</em>, a cute gay romance book about the biracial bisexual son of the first female American president falling in love with a prince of England.
Sure, it was a peak Trump-era liberal what-if fantasy escapism, and sure, it was cheesy and overly cutesy at times, and yeah, some of the politics seemed a little over-simplified at times, but who cares?
It served its purpose, the characters were fun, and it was a nice escape from the brutality of <em>Seven Killings</em> and from the stress of the real world.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-smallest-light-in-the-universedagger-sarah-seager"><em>The Smallest Light in the Universe</em>\(^\dagger\), Sarah Seager</h2>
<p><img src="/assets/images/2022-01-03-books/smallest.jpg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 308 pages, published in 2020.</p>
<p>One of my high school teachers recommended this memoir about an MIT physicist’s search for exoplanets and grief from the death of her husband to cancer.
Both components of the book and their intersections are excellently written.
As someone who knows extremely little about physics, but who’s always been vaguely interested in space, I enjoyed Seager’s descriptions of the science used to evaluate whether an exoplanet could feasibly support life and of the intensity of her drive to pursue that work.</p>
<p>Her discussion of her husband’s death and the grief of Seager and her two young sons was remarkable in its vulnerability and openness.
I found myself gratefully admiring Seager for her resilience, but also for her ability to be honest about times when she was less resilient.
She captures moments that many of us might be too uncomfortable to bear to our peers let alone the whole world: her coping mechanisms, her over-reliance on others to help her complete simple tasks, the internal turmoil of falling in love again, the stresses of traveling as a single parent.
I appreciate that she shared her journey and has the humility, which many other academics lack, to show her shortcomings.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="exhalationdagger-ted-chiang"><em>Exhalation</em>*\(^\dagger\), Ted Chiang</h2>
<p><img src="/assets/images/2022-01-03-books/exhalation.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 352 pages, published in 2019.</p>
<p>I’d been itching to read something by Ted Chiang for a while, and I finally got the chance with a book club book.
<em>Exhalation</em> is a collection of sci-fi short stories that hits on a variety of themes on technology.</p>
<p>In “The Lifecycle of Software Objects,” a woman who is employed by a software company to train and socialize “digients,” AI animals with high capabilities for intelligence.
While reading it, I kept thinking it was going to fall for the standard tropes of books on AI: When will the digients go rogue and kill their handlers? Are they going to suddenly be declared “conscious” by some fuzzy definition?
Instead, the story focuses on the relationships among handlers and between handlers and digients and their adaptation to less global issues: What happens when everyone else wants to migrate to a new digital platform which is incompatible for the digients? Is it acceptable to make copies of digients and use the copies for less savory purposes (i.e. sex work)?</p>
<p>In other stories, he considers machines that provide windows between parallel universes, juxtaposes the development of retinal recordings with the introduction of the written word to an illiterate tribe, explores time travel in a medieval Islamic context (where those who pass through the gates are more accepting of fate and avoid meddling), and makes the case we ignore amazing forms of life at home (like parrots) while aspiring to find extraterrestrial life.
Unlike the two other works of sci-fi on this list, Chiang succeeds in marrying thought-provoking concepts with interesting characters.
I’m looking forward to reading his other anthology, <em>Stories of Your Life and Others.</em></p>
<p><a href="#">[top of page]</a></p>
<h2 id="fun-homedagger-alison-bechdel"><em>Fun Home</em>*\(^\dagger\), Alison Bechdel</h2>
<p><img src="/assets/images/2022-01-03-books/fun.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Non-fiction, 240 pages, published in 2006.</p>
<p>In this graphic novel, Alison Bechdel (as in the <a href="https://en.wikipedia.org/wiki/Bechdel_test">Bechdel test</a>) examines her childhood and the nuances of her relationship with her troubled father.
Both Bechdel and her father are gay, but Bechdel’s father remained closeted and had sexual relationships with teenage boys before taking his life not long after she came out to her parents.
She highlights the nuances of their relationships and the generational differences between gay people who came of age in the 50s and the 70s; she wonders whether her father’s fate would have been different if he had been born later and expresses sympathy, without qualifying her criticism of his illicit actions.</p>
<p>I appreciated her thoughtful exploration of her fragmented family and found myself thinking more about the questions she asked about generations. I’d unconsciously mentally grouped grouped LGBTQ people into only two groups: those who were of age during the AIDS epidemic and those who were not; this book made me think more about the intense repression faced by those in my grandparents’ generation.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-vegetarian-han-kang"><em>The Vegetarian</em>*, Han Kang</h2>
<p><img src="/assets/images/2022-01-03-books/vegetarian.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 188 pages, published in 2016.</p>
<p>This might just be the most bizarre novel I’ve ever read.
It starts with a decision by a Korean woman, Yeong-hye, to give up meat and become a vegetarian.
Her husband and father respond to her rebellion in a shockingly harsh manner, given how widespread vegetarianism is.
The first section of the book is narrated by her husband, the second by her brother-in-law, and the third by her sister.
It’s about passivity, it’s about abuse, it’s about a lack of respect for female autonomy.
It’s also about becoming a plant and sex scenes where both participants are covered in painted flowers.
I won’t pretend to claim I fully understood this one, but it’s an interesting exploration of Yoeng-hye’s strange ways of taking control of herself.
It’s really not about vegetarianism at all.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-plague-albert-camus"><em>The Plague</em>, Albert Camus</h2>
<p><img src="/assets/images/2022-01-03-books/plague.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 308 pages, published in 1947.</p>
<p>I waited long enough into the pandemic before reading <em>The Plague</em>, with the hope that I could pick up on its prescience for COVID without it hitting too close to home.
It tells the story of a town in North Africa that is infested with a rat plague and follows Rieux, a doctor who treats infected patients whose commitment to treating patients is strong enough that adopting a fatalist attitude is unimaginable to him.
Much of the book is about individual powerlessness in the face of an epidemic that acts according to its own whims; while Rieux’s choices to treat others are obvious to him, others struggle more to find meaning amidst the suffering and the arbitrariness of the plague’s killing.
Camus is critical of handling forces outside our control (like a plague) by embracing absurdity; rather, one should fight it even against odds of failure.</p>
<p>Naturally, much of it felt very familiar from the perspective of COVID. The journalist’s desperation for an excuse to leave the city rings true for those of us who had voices in our head trying to rationalize why the lockdown ought not apply to us.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="the-remains-of-the-daydagger-kazuo-ishiguro"><em>The Remains of the Day</em>\(^\dagger\), Kazuo Ishiguro</h2>
<p><img src="/assets/images/2022-01-03-books/remains.jpg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 258 pages, published in 2005.</p>
<p>This may have been my favorite book of the year.
Ishiguro chronicles the reflections of an aging English butler in the 1950s as the social order he spent his life serving decays and his recently-deceased lord’s name is besmirched.
The events of the book take place over a drive in the countryside taken by the butler as he ponders his past and the morality of his beloved Lord Darlington.
Throughout, he grapples with the question of dignity: Does dignity as a butler require serving a virtuous master? And if so, is it a core duty of a great butler to frequently assess the morals of the lord he serves?
His ruminations on what it means to be a great memory and to have dignity causes him to revisit past memories about his father’s own career as a butler, on Lord Darlington’s association with German leaders before WWII, and his relationship with the housekeeper.
His reflections are clear, thoughtful, and at times humorous, and I often found myself jarred by the gap between their clarity and his awkwardness in actual conversation.</p>
<p>I thoroughly enjoyed the book. It was a pleasure to read, and it fueled questions for my own reflection as I think about what dignity means in the context of my own career.</p>
<p><a href="#">[top of page]</a></p>
<h2 id="conversations-with-friends-sally-rooney"><em>Conversations with Friends</em>, Sally Rooney</h2>
<p><img src="/assets/images/2022-01-03-books/conversations.jpeg" style="width: 30%; padding-right: 20px" align="left" /></p>
<p>Fiction, 304 pages, published in 2017.</p>
<p>Rooney’s novel follows four main characters: Frances, an introspective and sharp writer and college student; Bobbi, her more bombastic and radical friend and former lover; Melissa, a socialite and older journalist who is intrigued by the girls; and Nick, Melissa’s passive husband and an actor.
Frances and Nick’s love affair is the main focus of the plot, but it’s also the only book I’ve read where there’s actually a “love square” with four edges.
The characters were interesting (although irritating at times), and it functioned as a coming-of-age book of sorts for Frances and Bobbi, as they come to better explore the power dynamics of their complex relationships with one another with the older couple.
The malaise and insecurity it detailed was similar in some ways to that of <em>The New Me</em>, without the extreme dysfunction.</p>
<p>Overall, I’d say it wasn’t a perfect fit for me, but I’m still glad to have read it.</p>
<p><a href="#">[top of page]</a></p>
<p>…</p>
<p><em>Thanks for making it all the way down!
This took a while for me to write. (For 2022, I think I’ll try to write these little summaries as I go, so I’ll remember them better, and so I won’t have to do them all in one day.)
Once again, happy new year, and I look forward to having more content soon (ML or otherwise).</em></p>Clayton SanfordHappy new year! One of my primary goals of 2021 was to create this blog, which I actually managed to achieve. In 2022, I intend to continue writing on this blog and to also post content that can be read by anyone, not just people who do research on machine learning theory. As a first attempt to write something that’ll appeal to non-computer scientists, here’s a quick post listing and commenting on the non-technical books I read last year.How do SVMs and least-squares regression behave in high-dimensional settings? (NeurIPS 2021 paper with Navid and Daniel)2021-12-07T00:00:00+00:002021-12-07T00:00:00+00:00http://blog.claytonsanford.com/2021/12/07/ash21<p>Hello, it’s been a few weeks since I finished my candidacy exam, and I’m looking forward to getting back to blogging on a regular basis.
I’m planning on focusing primarily on summarizing others’ works and discussing what I find interesting in the literature, but I periodically want to share my own papers and explain them less formally.
I did this a few months ago for first grad student paper on the approximation capabilities of depth-2 random-bottom-layer neural networks <a href="/2021/08/15/hssv21.html" target="_blank">HSSV21</a>.</p>
<p>This post does the same for <a href="https://proceedings.neurips.cc/paper/2021/hash/26d4b4313a7e5828856bc0791fca39a2-Abstract.html" target="_blank">my second paper</a>, which is on support vector machines (SVMs) and ordinary least-squares regression (OLS) in high-dimensional settings.
I wrote this paper in collaboration with Navid Ardeshir, another third-year PhD student at Columbia studying Statistics, and our advisor, <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>.
It will appears at NeurIPS 2021 this week: a talk recorded by Navid is <a href="https://neurips.cc/virtual/2021/poster/27524" target="_blank">here</a>, our paper reviews are <a href="https://openreview.net/forum?id=9bqxRuRwBlu" target="_blank">here</a>, and our poster will be virtually presented on Thursday 12/9 between 8:30am and 10am Pacific time.</p>
<p>I’d love to talk with anyone about this paper, so if you have any questions, comments, or rants, please comment on this post or send me an email.</p>
<h2 id="what-are-ols-and-svms">What are OLS and SVMs?</h2>
<p>The key result of our paper is that two linear machine learning models coincide in the high-dimensional setting.
That is, when the dimension \(d\) is much larger than that number of samples \(n\), the solutions of the two models on the same samples have the same parameters.
This is notable because the models have different structures and appear at first-glance to incentivize different kinds of solutions.
It’s also perplexing because the models do not seem to be analogous–OLS is a regression learning algorithm and SVM is a classification algorithm.
We’ll briefly explain what the two models are below and what they mean in the high-dimensional setting.</p>
<p>Both of these models were discussed extensively in <a href="/2021/07/04/candidacy-overview.html" target="_blank">my survey</a> on over-parameterized ML models, and I’ll periodically refer back to some of those paper summaries (and occasionally steal visuals from my past self).</p>
<h3 id="ols-regression-and-minimum-norm-interpolation">OLS regression and minimum-norm interpolation</h3>
<p>The task of ordinary least-squares (OLS) regression is simple: find the linear function (or hyperplane) that best fits some data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}\).
To do so, we learn the function \(x \mapsto w_{OLS}^T x\), where \(w_{OLS}\) solves the following optimization problem, minimizing the mean-squared error between training labels \(y_i\) and each prediction \(w_{OLS}^T x_i\):</p>
\[w_{OLS} \in \arg\min_{w \in \mathbb{R}^d} \sum_{i=1}^n (y_i - w_{OLS}^T x_i)^2.\]
<p>For the “classical” learning regime, where \(d \ll n\), \(w_{OLS}\) can be explicitly computed with \(w_{OLS} = X^{\dagger} y = (X^T X)^{-1} X^T y\), where \(X = (x_1, \dots, x_n) \in \mathbb{R}^{n \times d}\) and \(y = (y_1, \dots, y_n) \in \mathbb{R}^d\) collect all of the training inputs and labels into a single matrix and vector, and where \(X^{\dagger}\) is the pseudoinverse of \(X\).
In the event where \(X^T X \in \mathbb{R}^{d \times d}\) is invertible (which is typically true when \(d \ll n\), although it may not be in cases where there is a lot of redundancy in the features and the columns of \(X\) are colinear), the corresponds to the unique minimizer of the above optimization problem.
Intuitively, this corresponds to choosing the linear function that will most closely approximate the labels of the samples, but one that will not likely perfectly fit the data.</p>
<p>As discussed in my blog posts on <a href="/2021/07/05/bhx19.html" target="_blank">BHX19</a> and <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, this works well in the under-parameterized regime, but it’s not directly obvious how one should choose the best parameter vector \(w_{OLS}\) in the over-parameterized regime \(d \gg n\), since there are many parameter vectors that result in zero training error.
These papers choose the vector by considering <em>minimum-norm interpolation</em> as the high-dimensional analogue to OLS. This entails solving the following optimization problem, which relies on \(XX^T \in \mathbb{R}^{n \times n}\) being invertible, which is typically the case when \(d \gg n\):</p>
\[w_{OLS} \in \arg\min_{w \in \mathbb{R}^d} \|w\|_2 \ \text{such that} \ w^T x_i = y_i \ \forall i \in [n].\]
<p>In other words, it chooses the hyperplane with the smallest weight norm (which we can think of as the “smoothest” hyperplane or the hyperplane with smallest slope) that perfectly fits the data. Conveniently, this hyperplane is also found by using the pseudo-inverse of \(X\): \(w_{OLS} = X^{\dagger} y = X^T (X X^T)^{-1} y\).
As a result, we (and numerous papers that consider over-parameterized linear models) consider this minimum-norm interpolation problem to be the high-dimensional version of OLS, which allows OLS to be defined with the same psuedo-inverse solution for all choices of \(n\) and \(d\).</p>
<p>Notably, these other papers show that high-dimensional OLS can have good generalization under certain distributional assumptions, despite the fact that classical generalization bound approaches (like VC-dimension) suggest that models with more parameters than samples are likely to fail.
These results are a big part of the inspiration for this project and motivate the study of high-dimensional linear regression.</p>
<h3 id="support-vector-machines">Support vector machines</h3>
<p>SVMs are a classification problem, rather than a regression problem, which means that a training sample \((x_i, y_i)\) can be thought of as belonging to \(\mathbb{R}^d \times \{-1, 1\}\).
Instead, the goal is to learn a linear classifier of the form \(x \mapsto \text{sign}(w_{SVM}^T x)\) that <em>decisively</em> classifies every training sample.
That is, we want it to be the case that \(w_{SVM}^T x_i\) be bounded away from zero for every \(x_i\).
This follows the same motivation as the generalization bounds on <a href="/2021/10/20/boosting.html" target="_blank">boosting the margin</a>; decisively categorizing each training sample makes it hard for the chosen function to be corrupted by the variance of the training data. It also requires the assumption that the training data are linearly separable.</p>
<p>This high-level goal for a classifier (called the <em>hard-margin SVM</em>) can be encoded as the following optimization problem, which asks that \(w_{SVM}\) be the lowest-magnitude classifier that separates the samples from the decision boundary by distance at least one:</p>
\[w_{SVM} \in \arg\min_{w \in \mathbb{R}^d} \|w\|_2 \ \text{such that} \ y_i w^T x_i \geq 1 \ \forall i \in [n].\]
<p>By stealing an image from my past blog post, we can visualize the classifiers have maximum margin.</p>
<p><img src="/assets/images/2021-10-28-cl20/margin.jpeg" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /></p>
<p>A key features of SVMs is that the classifier can also be defined by a subset of the training samples, the ones that lie exactly on the margin, i.e. have \(w_{SVM}^T x_i = y_i\).
These are called the <em>support vectors</em>.
If \(x_1, \dots, x_k\) are the support vectors of \(w_{SVM}\), then \(w_{SVM} = \sum_{i=1}^k \alpha_i x_i\) for some \(\alpha \in \mathbb{R}^k\). Traditionally, bounds on the generalization powers of SVMs depend on the number of support vectors: fewer support vectors means an intrinsically “simpler” model, which indicates a higher likelihood that the model is robust and generalizes well to new data.</p>
<h3 id="support-vector-proliferation-or-ols--svm">Support vector proliferation, or OLS = SVM</h3>
<p>By looking back at the two optimization problems for high-dimensional OLS and SVMs, the two are actually extremely similar.
In the case where the OLS problem has binary labels \(\{-1, 1\}\), the two are exactly the same, except that the SVM problem has inequality constraints and OLS has equality.
Therefore, in the event that the optimal SVM solution \(w_{SVM}\) satisfies each inequality constraint with equality, then \(w_{SVM} = w_{OLS}\).
Because a constraint is satisfied with equality if and only if the corresponding sample is a support vector, \(w_{SVM} = w_{OLS}\) if and only if every training sample is a support vector.
We call this phenomenon <em>support vector proliferation</em> (SVP) and explore it as the primary goal of our paper.
Our contributions involve studying when SVP occurs and when it does not, which has implications for SVM generalization and the high-dimensional behavior of both models.</p>
<h3 id="why-care-about-svp-and-what-is-known">Why care about SVP and what is known?</h3>
<p>The study of support vector proliferation has previously provided bounds on generalization behavior of high-dimensional (or over-parameterized) SVMs, and our tighter understanding of the phenomenon will make future bounds easier.
In particular, the paper <a href="/2021/11/04/mnsbhs20.html" target="_blank">MNSBHS20</a> (which includes Daniel as an author) bounds the generalization of high-dimensional SVMs by (1) using SVP to relate SVMs to OLS and (2) showing that OLS with binary outputs has favorable generalization guarantees under certain distributional assumptions, similar to those of <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>.
Specifically, they show that SVP occurs roughly when \(d = \Omega(n^{3/2} \log n)\) for the case where the variances of each feature are roughly the same.</p>
<p>This paper does not answer how tight the phenomenon is, leaving open the question of when (as a function of \(d\) and \(n\)) will SVP occur and when will it not.
This question was partially addressed in a follow-up paper, <a href="https://arxiv.org/abs/2009.10670" target="_blank">HMX21</a> by Daniel, Vidya Muthukumar, and Mark Xu.
The show roughly that SVP occurs (for a broad family of data distributions) when \(d = \Omega(n \log n)\) and that it does not occur (for a narrow family of distributions) when \(d = O(n)\), leaving open a logarithmic gap.
Our paper closes this gap and considers a broader family of data distributions.</p>
<p>SVM generalization has also been targeted by <a href="/2021/10/28/cl20.html" target="_blank">CL20</a> and others using approaches that rely not on SVP by on the relationship between SVMs and gradient descent. Specifically, they rely on a fact from <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a>, which shows that gradient descent applied to logistic losses converges to a maximum-margin classifier.
This heightens the relevance of support vector machines, since more sophisticated models may trend towards the solutions of hard-margin SVMs when trained with gradient methods.
Thus, our exploration of SVP and how it relates minimum-norm and maximum-margin models may have insights about the high-dimensional behavior of other learning algorithms that rely on implicit regularization.</p>
<h2 id="what-do-we-prove">What do we prove?</h2>
<p>Before jumping into our results, we introduce our data model and explain what HMX21 already explained in that setting.</p>
<h3 id="data-model">Data model</h3>
<p>We consider two settings each of which have independent random features for each \(x_i\) and fixed labels \(y_i\).</p>
<p><strong>Isotropic Gaussian sample:</strong> For fixed \(y_1, \dots, y_n \in \{-1, 1\}\), each sample \(x_1, \dots, x_n \in \mathbb{R}^d\) is drawn independently from a multivariate spherical (or isotropic or standard) Gaussian \(\mathcal{N}(0, I_d)\).</p>
<p><strong>Anisotropic subgaussian sample:</strong> For fixed \(y_1, \dots, y_n \in \{-1, 1\}\), each sample \(x_i\) is defined to be \(x_i = \Sigma^{1/2} z_i\), where each \(z_i\) is drawn independently from a 1-subgaussian distribution with mean zero and \(\Sigma\) is a diagonal covariance matrix with entries \(\lambda_1 > \dots > \lambda_d\). Hence, \(\mathbb{E}[x_i] = 0\) and \(\mathbb{E}[x_i x_i^T] = \Sigma\).</p>
<p>If the latter model has a Gaussian distribution, then \(\Sigma\) can be permitted to be any positive definite covariance matrix with eigenvalues \(\lambda_1, \dots, \lambda_n\) due to the rotational symmetry of the Gaussian.</p>
<p>We consider the regime \(d \gg n\) in order to ensure that the data are linearly separable with extremely high probability, which is acceptable because the paper is focused on the study of the over-parameterized regime.</p>
<p>The anisotropic data model requires using dimension proxies rather than \(d\) on occasion, because the rapidly decreasing variances could cause the data to have a much smaller effective dimension. (Similar notions are explored in HMX21 and over-parameterization papers like BLLT19.)
We use two notions of effective dimension: \(d_\infty = \frac{\|\lambda\|_1}{\|\lambda\|_\infty}\) and \(d_2 = \frac{\|\lambda\|_1^2}{\|\lambda\|_2^2}\). Note that \(d_\infty \leq d_2 \leq d\).</p>
<h3 id="contributions-of-hmx21">Contributions of HMX21</h3>
<p>HMX21 proves two bounds: an upper bound on the SVP threshold for an anisotropic subgaussian sample and a lower bound on the SVP threshold for an isotropic gaussian sample.</p>
<p><em><strong>Theorem 1</strong> [HMX21]: For an anisotropic subgaussian sample, if \(d_\infty = \Omega(n \log n)\), then SVP occurs with probability at least \(0.9\).</em></p>
<p><em><strong>Theorem 2</strong> [HMX21]: For an isotropic Gaussian sample, if \(d = O(n)\), then SVP occurs with probability at most \(0.1\).</em></p>
<p>This leaves open two obvious technical questions, which we resolve: closure of the \(n\) vs \(n \log n\) gap and generalization of Theorem 2 to handle the anisotropic subgaussian data model. We give these results, and a few others about more precise thresholds, in the next few sections.</p>
<h3 id="result-1-closing-the-gap-for-the-isotropic-gaussian-case">Result #1: Closing the gap for the isotropic Gaussian case</h3>
<p>We close the gap between the two HMX21 bounds by showing that the critical SVP threshold occurs at \(\Theta(n \log n)\).
The following is a simplified version of our Theorem 3, which is presented in full generality in the next section.</p>
<p><em><strong>Theorem 3</strong> [Simplified]: For an isotropic Gaussian sample, if \(d = O(n \log n)\) and \(n\) is sufficiently large, then SVP occurs with probability at most \(0.1\).</em></p>
<p>In the version given in the paper, there is also a \(\delta\) variable to represent the probability of SVP occuring; for simplicity, we leave this out of the bound in the blog post.</p>
<p>We’ll discuss key components of the proof of this theorem later on in the blog post.</p>
<h3 id="result-2-extending-the-lower-bound-to-the-anisotropic-subgaussian-case">Result #2: Extending the lower bound to the anisotropic subgaussian case</h3>
<p>Our version of Theorem 3 further extends Theorem 2 to the anisotropic subgaussian data model, at the cost of some more complexity.</p>
<p><em><strong>Theorem 3</strong> : For an anisotropic subgaussian sample, if \(d_2 = O(n \log n)\), \({d_\infty^2}/{d_2} = {\|\lambda\|_2^2}/{\|\lambda\|_\infty^2} = \Omega(n)\), and \(n\) is sufficiently large, then SVP occurs with probability at most \(0.1\).</em></p>
<p>The second condition ensures that the effective number of points with high variance is at least as large as \(n\). If it were not, then a very small number of features would have an outsize influence on the outcome of the problem, making it effectively a low-dimensional problem where the data are unlikely even to be linearly separable.</p>
<p>The first condition is slightly loose in the event that \(d_2 \gg d_\infty\), since Theorem 1 depends on \(d_\infty\) rather than \(d_2\).</p>
<h3 id="result-3-proving-a-sharp-threshold-for-the-isotropic-gaussian-case">Result #3: Proving a sharp threshold for the isotropic Gaussian case</h3>
<p>Returning to the simple isotropic Gaussian regime, we show a clear threshold in the regime where \(n\) and \(d\) become arbitrarily large. Theorem 4 shows that the phase transition occurs precisely when \(d = 2n \log n\) in the asymptotic case. Check out the paper for a rigorous asymptotic statement and a proof that depends on the maximum of weakly dependent Gaussian variables.</p>
<p><em>Note: One nice thing about working with a statistician is that we have different flavors of bounds that we like to prove. As a computer scientist, I’m accustomed to proving Big-\(O\) and Big-\(\Omega\) bounds for finite \(n\) and \(d\) in Theorem 3, while hiding foul constants behind the asymptotic notation. On the other hand, Navid is more interested in the kinds of sharp trade-offs that occur in infinite limits, like those in Theorem 4.
Our collaboration meant we featured both!</em></p>
<h3 id="result-4-suggesting-the-threshold-extends-beyond-that-case">Result #4: Suggesting the threshold extends beyond that case</h3>
<p>While we only prove the location of the precise threshold and the convergence to that threshold for the isotropic Gaussian regime, we believe that it persists for a broad class of data distributions, including some that are not subgaussian. Our Figure 1 visualizes this universality by visualizing the fraction of trials on synthetic data where support vector proliferation occurs when the samples are drawn from each type of distribution.</p>
<p><img src="/assets/images/2021-12-07-ash21/univ.png" alt="" /></p>
<h3 id="conjecture-generalization-to-l_1-and-l_p-models">Conjecture: Generalization to \(L_1\) (and \(L_p\)) models</h3>
<p>We conclude by generalizing the SVM vs OLS problem to different norms and making a conjecture that the SVP threshold occurs when \(d\) is much larger for the \(L_1\) case. For the sake of time, that’s all I’ll say about it here, but check out the paper to see our formulation of the question and some supporting empirical results.</p>
<h2 id="proof-of-result-1">Proof of Result #1</h2>
<p>I’ll conclude the post by briefly summarizing the foundations of our proof of the simplified version of Theorem 3. This was an adaptation of the techniques employed by HMX21 to prove Theorem 2, but it required a more careful approach to handle the lack of independence among a collection of random variables.</p>
<h3 id="equivalence-lemma">Equivalence lemma</h3>
<p>Both papers rely on the same “leave-one-out” equivalence lemma for their upper and lower bounds. We prove a more general version in our paper based on geometric intuition, but I give only the simpler one here.</p>
<p>Let \(y_{\setminus i} = (y_1, \dots, y_{i-1}, y_{i+1}, \dots, y_n) \in \mathbb{R}^{n-1}\) and \(X_{\setminus i} = (x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) \in \mathbb{R}^{(n-1) \times d}\).</p>
<p><em><strong>Lemma 1:</strong> Every training sample is a support vector (i.e. SVP occurs and OLS=SVM) if and only if \(u_i := y_i y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} x_i < 1\) for all \(i \in [n]\).</em></p>
<!-- As a result, SVP occurs (and OLS = SVM) if and only if $$\max_i u_i < 1$$.
-->
<p>This lemma looks a bit ugly as is, so let’s break it down and explain hazily why the connection these \(u_i\) quantities connect to whether a sample is a support vector.</p>
<ul>
<li>Noting that \(X_{\setminus i}^\dagger = (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i}\) when \(d \gg n\) and referring back to the optimization problems from before, we can let \(y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} := w_{OLS, i}^T\), because it represents the parameter vector obtained by running OLS on all of the training data except \((x_i, y_i)\).</li>
<li>Then, \(u_i = y_i x_i^T w_{OLS, i}\), which is the margin of the “leave-one-out” regression on the left out sample.</li>
<li>If \(u_i \geq 1\), then the OLS classifier on the other \(n-1\) samples already classifies \(x_i\) correctly by at least a unit margin. If the \(w_{OLS, i} = w_{SVM, i}\), then it suffices to take \(w_{SVM} = w_{OLS, i}\) without adding a new support vector for \(x_i\) and without increasing the cost of the objecting. Hence, the condition means that the partial solution offers proof that not everything needs to be a support vector.</li>
<li>If \(u_i < 1\), then \(x_i\) is not classified to a unit margin by \(w_{OLS, i}\). Therefore, adding \((x_i, y_i)\) back into the training set requires modifying the parameter vector; since the vector would then depend on \(x_i\), making \(x_i\) a support vector of \(w_{SVM}\).</li>
</ul>
<p>The remainder of the proof involves considering \(\max_i u_i\) and asking how large it must be.</p>
<h3 id="assuming-independence">Assuming independence</h3>
<p>In the Gaussian setting where \(X_{\setminus i}\) is fixed, \(u_i\) is a univariate Gaussian random variable of mean 0 and variance \(y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} X_{\setminus i} X_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1}y_{\setminus i} = y_{\setminus i}^T (X_{\setminus i} X_{\setminus i}^T)^{-1} y_{\setminus i}\).</p>
<p>Because \(\mathbb{E}[x_j^T x_j] = d\), it follows that \(\mathbb{E}[X_{\setminus i} X_{\setminus i}^T] = d I_{n-1}\) and that the eigenvalues of \(X_{\setminus i} X_{\setminus i}^T\) are concentrated around \(d\) with high probability.
As a result, the eigenvalues of \((X_{\setminus i} X_{\setminus i}^T)^{-1}\) are concentrated around \(1/d\), and the variance of \(u_i\) is roughly \(\frac{1}{d} \|y_{\setminus i}\|_2 = \frac{n-1}d\).</p>
<p>If we assume for the sake of simplicity that \(u_i\) are all independent of one another, then the problem becomes easy to characterize.
It’s well-known the maximum of \(n\) Gaussians of variance \(\sigma^2\) concentrates around \(\sigma \sqrt{2 \log n}\).
Hence, \(u_i\) will be roughly \(\sqrt{2(n-1)\log(n) / d}\) with high probability.
If \(d = \Omega(n \log n)\), then \(\max_{u_i} < 1\) with high probability and SVP occurs; if \(d = O(n \log n)\), then SVP occurs with with vanishingly small probability.</p>
<h3 id="overcoming-dependence">Overcoming dependence</h3>
<p>The key problem with the above paragraphs is that the random variables \(u_1, \dots, u_n\) are <em>not</em> independent of one another. They all depend on all of the data \(x_1, \dots, x_n\), and the core technical challenge of this result is to tease apart this dependence.
To do so, we rely on the fact that \(X_{\setminus i} X_{\setminus i}^T \approx d I_{n-1}\) and define a subsample of \(m \ll n\) points to force an independence relationship.
Specifically, we rely on the decomposition \(u_i = u^{(1)}_i + u^{(2)}_i + u^{(3)}_i\) for \(i \in [m]\) where:</p>
<ol>
<li>\(u^{(1)}_i = y_i y_{\setminus i}^T((X_{\setminus i} X_{\setminus i}^T)^{-1} - \frac1d I_{n-1}) X_{\setminus i} x_i\) represents the gap between the gram matrix \(X_{\setminus i} X_{\setminus i}^T\) and the identity.</li>
<li>\(u^{(2)}_i = \frac1d y_i y_{[m] \setminus i} X_{[m] \setminus i} x_i\) is the component of the remaining term (\(\frac1d y_i y_{\setminus i} X_{\setminus i} x_i\)) that depends exclusively on the subsample \([m]\).</li>
<li>\(u^{(3)}_i = \frac1d y_i y_{\setminus [m]} X_{\setminus [m]} x_i\) is the component that depends only on \(x_i\) and on samples <em>outside</em> the subsample. Critically, \(u^{(3)}_1, \dots, u^{(3)}_m\) are independent, conditioned on the data outside the sample, \(X_{\setminus [m]}\).</li>
</ol>
<p>To show that SVP occurs with very small probability, we must show that \(\max_i u_i \geq 1\) with high probability.
To do so, it’s sufficient to show that (1) for all \(i\), \(|u^{(1)}_i| \leq 1\); (2) for all \(i\), \(|u^{(2)}_i| \leq 1\); and (3) \(\max_i u^{(3)}_i \geq 3\). The main technical lemmas of the paper apply Gaussian concentration inequalities to prove (1) and (2), and leverage the independence of the \(u^{(3)}_i\)’s to prove that their maximum is sufficiently large.</p>
<p>This requires somewhat more advanced techniques, such as the <a href="https://en.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem_" target="_blank">Berry-Esseen theorem</a>, for the subgaussian case.</p>
<h2 id="whats-next">What’s next?</h2>
<p>We think the significance of this result is the tying together of seemingly dissimilar ML models by their behavior in over-parameterized settings. We think some immediate follow-ups on this include investigations into the generalized \(L_p\) SVM and OLS models, but further work could also work along the lines of <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a>, by connecting “classical” ML models (like maximum-margin models) to the implicit regularization behavior of more complex models.</p>
<p>Thanks for reading this post! If you have any questions or thoughts (or ideas about what I should write about), please share them with me.</p>Clayton SanfordHello, it’s been a few weeks since I finished my candidacy exam, and I’m looking forward to getting back to blogging on a regular basis. I’m planning on focusing primarily on summarizing others’ works and discussing what I find interesting in the literature, but I periodically want to share my own papers and explain them less formally. I did this a few months ago for first grad student paper on the approximation capabilities of depth-2 random-bottom-layer neural networks HSSV21.My candidacy exam is done!2021-11-17T00:00:00+00:002021-11-17T00:00:00+00:00http://blog.claytonsanford.com/2021/11/17/candidacy<p>After working my way through all thirty papers from my <a href="/2021/07/04/candidacy-overview.html" target="_blank">list</a>, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.</p>
<p>Anyways, <a href="/assets/files/candidacy-slides.pdf" target="_blank">here</a> are the slides I made for the presentation. At the end, there are thirty appendix slides, each of which gives a one-slide summary of a paper from the list.</p>
<p>I’m planning on continuing to blog from here, but not in such a structured fashion as I did with my OPML series.
I’m considering having some kind of weekly newsletter that gives quick recaps of papers I’ve read and things I find interesting, along with periodic longer posts about particularly neat papers, my own work, or personal stuff.
Thanks for reading, and stay tuned!</p>Clayton SanfordAfter working my way through all thirty papers from my list, I finally took (and passed) my candidacy exam yesterday! I suppose this means that I can finally call myself a PhD candidate, rather than a PhD student.[OPML#10] MNSBHS20: Classification vs regression in overparameterized regimes: Does the loss function matter?2021-11-04T00:00:00+00:002021-11-04T00:00:00+00:00http://blog.claytonsanford.com/2021/11/04/mnsbhs20<!-- [[OPML#10]](/2021/11/04/mnsbhs20.html){:target="_blank"} -->
<p><em>This is the tenth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Once again, we discuss a paper that shows how hard-margin support vector machines (SVMs) (or maximum-margin linear classifiers) can experience benign overfitting when the learning problem is over-parameterized.
The paper, <a href="https://arxiv.org/abs/2005.08054" target="_blank">“Classification vs regression in overparameterized regimes: Does the loss function matter?”</a>, was written by a Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu (my advisor!), and Anant Sahai.</p>
<p>While the kinds of results are similar to the ones discussed in <a href="/2021/10/28/cl20.html" target="_blank">last week’s post</a>, the methodology is quite different. Rather than studying the properties of the iterates of gradient descent, this paper shows that minimum-norm linear regression and SVMs coincide in the over-parameterized regime and shows that the models behave similarly in those cases; this phenomenon is known as <em>support vector proliferation</em> and discussed in depth by <a href="https://arxiv.org/abs/2009.10670" target="_blank">a follow-up paper by Daniel, Vidya, and Ji (Mark) Xu</a> and by <a href="https://arxiv.org/abs/2105.14084" target="_blank">my NeurIPS paper with Navid Ardeshir and Daniel</a>.</p>
<p>To make the point, the paper considers a narrow regime of data distributions and categorizes those distributions to determine (1) when the outputs of OLS regression and SVM classification coincide and (2) when each of those have favorable generalization error as the number of samples \(n\) and dimension \(d\) trend towards infinity.
We introduce their <em>bilevel ensemble</em> input distribution and their <em>1-sparse linear model</em> for determining labels.
Their results show that under similar conditions to those explored in BLLT19, benign overfitting is possible for classification algorithms like SVMs.
Indeed, for their distributional assumptions, benign overfitting is more common for classification than regression.</p>
<h2 id="ols-and-svm">OLS and SVM</h2>
<p>A key part of this paper’s story relies on the coincidence of support vector machines for classification and ordinary least squares for regression.
We introduce the two models and clarify why one might expect them to have similar solutions for the high-dimensional setting.</p>
<p>From last week, we define the hard-margin SVM classifier to be \(x \mapsto \text{sign}(\langle w_{SVM}, x\rangle)\) where</p>
\[w_{SVM} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \{-1, 1\}\).
This classifier maximizes the margins of linearly separable training data.
Notably, a training sample \((x_i, y_i)\) is a <em>support vector</em> if \(\langle w_{SVM}, x_i\rangle = y_i\), which means that \(x_i\) lies exactly on the margin and is as close as possible to the linear separator.
The hypothesis \(w_{SVM}\) can be alternatively represented as a linear combination of support vectors, which means that all samples not on the margin are irrelevant to the SVM classifier vector.
Traditionally, favorable generalization properties for SVMs are shown for the cases where the number of support vectors is small, which implies some degree of “simplicity” in the model.</p>
<p>If the model is over-parameterized (i.e. \(d > n\)), we define the <em>minimum-norm ordinary least squares (OLS) regression</em> predictor to be \(x \mapsto \langle w_{OLS}, x\rangle\) where</p>
\[w_{OLS} = \mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^d} \|w\|, \text{ such that } \langle w, x_i\rangle = y_i, \ \forall i \in [n],\]
<p>for training data \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}\).
The two are the same, except that the labels are \(\{-1, 1\}\) for SVM and \(\mathbb{R}\) for OLS and that the inequality constraints of the former are replaced by equalities in the latter.</p>
<p>Sufficient conditions for benign overfitting for OLS has been explored in past blog posts, like the ones on <a href="/2021/07/05/bhx19.html" target="_blank">BHX19</a>, <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a>, <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, and <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a>.
Conditions for SVMs were explored in <a href="/2021/10/28/cl20.html" target="_blank">CL20</a>.
This paper unifies the two by showing cases where \(w_{OLS} = w_{SVM}\) and transfers the benign overfitting results from OLS to SVMs.</p>
<p>If we assume that both problems (regression and classification) have \(\{-1, 1\}\) labels, then \(w_{OLS} = w_{SVM}\) is implied by having \(\langle w_{SVM}, x_i\rangle = y_i\) for all \(i\), which means that every sample is a support vector.
This is the support vector proliferation phenomenon briefly discussed before.</p>
<h2 id="data-model">Data model</h2>
<p>They prove their results over a simple data distribution, which is a special case of the distributions explored by BLLT19.
Specifically, they consider <em>bilevel Gaussian ensembles</em>, features \(x_i\) are drawn independently from a Gaussian distribution with diagonal covariance matrix \(\Sigma\) with diagonals \(\lambda_1, \dots, \lambda_d\) for \(d = n^p\) satisfying</p>
\[\lambda_j = \begin{cases}
n^{p - r - q} & j \leq n^r \\
\frac{1 - n^{-q}}{1 - n^{r - p}} & j \geq n^r
\end{cases}\]
<p>for \(p > 1\), \(r \in (0, 1)\), and \(q \in (0, p-r)\).
It’s called a bilevel ensemble because the first \(n^r\) coordinates are drawn from higher variance normal distributions than the remaining \(n^p - n^r\) coordinates. A few notes on this model:</p>
<ul>
<li>Because \(p > 1\), \(d = \omega(n)\) and the model is always over-parameterized.</li>
<li>\(r\) governs the number of high-importance features. Because \(r < 1\), there must always be a sublinear number of high-importance features.</li>
<li>If \(q\) were permitted to be \(p - r\), then the model would be spherical or isotropic and have \(\lambda_j = 1\) for all \(j\). On the other hand, if \(q = 0\), \(\lambda_j = 0\) for \(j \geq n^r\) and all of the variance would be on the first \(n^r\) features. Thus, \(q\) modulates how much more variance the high-importance features have than the low-importance features.</li>
<li>
<p>The variances are normalized to have their \(L_1\) norms always be \(d = n^p\):</p>
\[\|\lambda\|_1 = \sum_{j=1}^{n^d} \lambda_j = n^{p} \cdot n^{-q} + \frac{(n^p - n^r)(1 - n^{-q})}{1 - n^{r-p}} = n^{p} \cdot n^{-q} + n^p (1 - n^{-q}) = n^p.\]
</li>
<li>
<p>We can compute the effective dimension terms used in the BLLT19 paper:</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} = \begin{cases}
\frac{(n^r - k)n^{p-q} + n^p (1 - n^{-q})}{n^{p-r-q}} &= \Theta(n^{r + q}) & k < n^r \\
n^p - k & & k \geq n^r.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2}= \begin{cases}
\Theta(\min(n^p, n^{r + 2q})) & k < n^r \\
n^p - k & k \geq n^r.
\end{cases}\]
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/lambdas.jpeg" alt="" /></p>
<p>The labels \(y\) are chosen with the <em>1-sparse linear model</em>, which only considers one of the coordinates. That is, for some \(t \leq n^r\), we let \(w^* = \lambda_t^{-1/2} e_t\), where \(e_t \in \mathbb{R}^d\) is the vector that is all zeroes, except for a one at index \(t\).
Note that \(\|w^*\|^2 = \lambda_t^{-1} = n^{r+q - p}\).
That is, the labels are \(\text{sign}(\langle w^*, x\rangle) = \text{sign}(x_t)\).
<!-- We add noise by flipping the label with probability $$\sigma$$. -->
(For regression, we instead think of the labels as \(\langle w^*, x\rangle = \lambda_t^{-1/2} x_t\).)</p>
<p>From this data model alone, we can plug in the bounds of BLLT19 to see what they tell us. <em>Note: There actually isn’t a perfect analogue here, because BLLT includes additive label noise with variance \(\sigma^2\), while this blog post only considers the noiseless case of MNSBHS20. The purpose of these bounds is to illustrate what is known about a similar model.</em></p>
<ul>
<li>If \(r + q > 1\), then \(k^* = \min\{k \geq 0: r_k(\Sigma) = \Omega(n)\}\) is roughly \(n^p - O(n)\), which means that the \(\frac{k^*}{n}\) term of the bound makes the bound vacuous.</li>
<li>
<p>If \(r + q < 1\), then \(k^* = 0\). Then, the BLLT19 bounds yield an excess risk of at most</p>
\[O\left( \|w^*\|^2 \lambda_1 \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{\sigma^2 n}{R_{0}(\Sigma)} \right) = O\left( \sqrt{n^{r + q - 1}}+ \sigma^2 \max(n^{1-p}, n^{1-r-2q}) \right).\]
<p>For this bound to trend towards zero, it must be true that \(r + 2q > 1\) and that \(r+ q < 1\), which is already guaranteed.</p>
</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/bllt.jpeg" alt="" /></p>
<p>The bound given in the paper at hand will look slightly different. (e.g. it won’t have the first requirement because noise is done differently.)
In addition, it will distinguish between benign overfitting in the classification and regression regimes and show that it’s easier to obtain favorable generalization error bounds for regression.</p>
<h2 id="main-results">Main results</h2>
<p>They have two types of main results: Theorem 1 shows the sufficient conditions for the coincidence of the SVM and OLS weights \(w_{SVM}\) and \(w_{OLS}\), and Theorem 2 analyzes the generalization of the excess errors of both classification and regression.</p>
<h3 id="when-does-svm--ols">When does SVM = OLS?</h3>
<p><em><strong>Theorem 1:</strong> For sufficiently large \(n\), \(w_{SVM} = w_{OLS}\) with high probability if</em></p>
<p>\(\|\lambda\|_1 = \Omega(\|\lambda\|_2 n \sqrt{\log n} + \|\lambda\|_\infty n^{3/2} \log n)\).</p>
<p>Equivalently, it must hold that \(R_0(\Sigma) = \Omega(\sqrt{n}(\log n)^{1/4})\) and \(r_0(\Sigma) = \Omega(n^{3/2} \log n)\).
The holds for the bilevel model when \(r + q > \frac{3}{2}\).</p>
<p>In the <a href="https://arxiv.org/abs/2009.10670" target="_blank">two</a> <a href="https://arxiv.org/abs/2105.14084" target="_blank">follow-ups</a>, this bound is changed to \(r_0(\Sigma) = \Omega(n \log n)\) and the phenomenon is shown to NOT occur when \(R_0(\Sigma) = O(n \log n)\).
Thus, this can actually be shown to occur for the bilevel model when \(r + q > 1\).</p>
<p>The proof of the theorem in this paper relies on applying bounds on Gaussian concentration and properties of the <a href="https://en.wikipedia.org/wiki/Inverse-Wishart_distribution" target="_blank">inverse-Wishart distribution</a>.
The future results rely on tighter concentration bounds, a leave-one-out equivalence that is true when a sample is a support vector, and a trick that relates the relevant quantities to a collection of independent random variables.</p>
<h3 id="generalization-bounds">Generalization bounds</h3>
<p>Their generalization bounds apply to the OLS solutions for two cases, (1) where the labels are real-valued and (2) where the labels are Boolean \(\{-1,1\}\).
We call the minimum norm solutions of these \(w_{OLS, real}\) and \(w_{OLS, bool}\).
Thus, when \(r\) and \(q\) are large enough for Theorem 1 to guarantee that OLS = SVM, then the bounds for Boolean labels apply.</p>
<p><em><strong>Theorem 2:</strong> For a bilevel data model that is 1-sparse without label noise, the classification error \(\lim_{n \to \infty} \mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0]\) and regression excess MSE error \(\lim_{n \to \infty} \mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2]\) satisfy the following for the given settings of \(p\), \(q\), and \(r\):</em></p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{OLS, bool}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (0, 1)\)</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p>This table tells us several things about the differences in generalization between classification and regression.</p>
<ul>
<li>When \(\Sigma\) has a relatively even distribution of variance between the high-importance and low-importance coordinates and when there are relatively few coordinates, there tends to be favorable generalization for both classification and regression.
The reverse is true when there is a sharp cut-off between variances and when there are many high-importance features.
This fits a similar intuition to BLLT19, which forbids too sharp a decay of variances.</li>
<li>One might observe that this doesn’t have the other requirement from BLLT: that the variances do not decay too gradually, which is enforced by \(r + 2q > 1\). This is absent here because this data model does not include label noise, so the risk of a model being corrupted by overfitting noisy labels is minimized.</li>
<li>There is also a regime in between where classification generalizes well, but regression does not.</li>
</ul>
<p><img src="/assets/images/2021-11-04-mnsbhs20/ols.jpeg" alt="" /></p>
<p>By combining the improved results on support vector proliferation with Theorem 2, we can obtain the following table of results for SVM vs OLS.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Classification error \(w_{SVM}\)</td>
<td>Regression error \(w_{OLS, real}\)</td>
</tr>
<tr>
<td>\(r + q \in (1, \frac{p+1}{2})\)</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>\(r + q \in (\frac{p+1}{2}, p)\)</td>
<td>1</td>
<td>\(\frac12\)</td>
</tr>
</tbody>
</table>
<p><img src="/assets/images/2021-11-04-mnsbhs20/svm.jpeg" alt="" /></p>
<p>How do these generalization bounds work? They’re similar to the flavor of argument given in <a href="/2021/07/16/mvss19.html" target="_blank">MVSS19</a>, which considers signal bleed and signal contamination.
Put roughly, an interpolating model can perform poorly if either the true signal gets split up among a bunch of orthogonal aliases that each interpolate the training data (signal bleed), or too many spurious correlations are incorporated into the chosen alias (signal contamination).
They assess and bound these notions by introducing the <em>survival</em> and <em>contamination</em> terms as</p>
\[\mathsf{SU}(w, t) = \frac{w_t}{w^*_t} = \sqrt{\lambda_t} w_t \ \text{and} \ \mathsf{CN}(w, t) = \sqrt{\mathbb{E}[(\sum_{j\neq t} w_j x_j)^2]} = \sqrt{\sum_{j \neq t} \lambda_j w_j^2}\]
<p>This formulation is easy due to the 1-stable assumption of the labels.
It seems like it may be possible to write something similar without this data model, but it would probably require much uglier expressions and more complex distributional assumptions to make the proof work.</p>
<p>The proof then uses Proposition 1 to relate the classification and regression errors to the survival and contamination terms and concludes by using Lemmas 11, 12, 13, 14, and 15 to place upper and lower bounds on those terms. Prop 1 shows the following relationships:</p>
\[\mathrm{Pr}[\langle x, w_{OLS, bool}\rangle\langle x, w^*\rangle > 0] = \frac12 - \frac1{\pi} \tan^{-1} \left(\frac{\mathsf{SU}(w, t)}{\mathsf{CN}(w, t)} \right)\]
\[\mathbb{E}[\langle x, w^* - w_{OLS, real} \rangle^2] = (1 - \mathsf{SU}(w, t))^2 + \mathsf{CN}(w, t)^2\]
<p>From looking at these terms, it should be intuitive why classification error is more likely to go to zero than regression error: It is sufficient for \(\mathsf{CN}(w, t)\) to become arbitrarily small for the the classification error to approach zero, even if \(\mathsf{SU}(w, t)\) is a constant smaller than 1. On the other hand, it must be the case that \(\mathsf{CN}(w, t)\to 0\) <em>and</em> \(\mathsf{SU}(w, t)\to 1\) for the regression error to go to zero.</p>
<p>The concentration bounds in the lemmas are gory and I don’t plan to go into them here. They rely on a slew of concentration bounds that are made possible by the Gaussianity of the inputs and the tight control of their variances.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>This was another really interesting paper for me, although I wasn’t quite brave enough to venture through all of the proofs of this one.
It’s primarily interesting as a proof of concept; the assumptions are prohibitively restrictive (only one relevant coordinate, Gaussian inputs), but the proofs would have been sickening to the point of being unreadable if many of these assumptions were dropped. This paper was an inspiration for my collaborators and me to investigate support vector proliferation in more depth, and these are a nice complement to CL20, which proves bounds for more restricted values of \(d\) and without relying on limits.</p>
<p>Thanks for joining me once again! The next entry–and possibly the last entry of this series–will be posted next week. When the actual exam occurs in two weeks, I might have one last recap post of what’s been discussed so far.</p>Clayton Sanford[OPML#9] CL20: Finite-sample analysis of interpolating linear classifiers in the overparameterized regime2021-10-28T00:00:00+00:002021-10-28T00:00:00+00:00http://blog.claytonsanford.com/2021/10/28/cl20<!-- [[OPML#9]](/2021/10/28/cl20.html){:target="_blank"} -->
<p><em>This is the ninth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>Like <a href="/2021/10/20/boosting.html" target="_blank">last week’s post</a>, we’ll step away from linear regression and discuss how over-parameterized <em>classification</em> models can achieve good generalization performance.
Unlike last week’s post, we focus on <em>maximum-margin classifiers</em> (or <em>support vector machines</em>) that interpolate the data in high-dimensional settings.
The paper is called <a href="https://arxiv.org/abs/2004.12019" target="_blank">“Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime”</a> and was written by Nilidri Chatterji and Philip Long.</p>
<h2 id="maximum-margin-classifier">Maximum-margin classifier</h2>
<p>Suppose we have some linearly separable training data.
There are many different strategies of choosing a linear separator for those data, and it’s unclear off the bat which ones will generalize best to novel samples.
To sketch the issue, the below visualization shows how two linearly separable classes have many valid hypotheses that interpolate the training data and have zero training error.</p>
<p><img src="/assets/images/2021-10-28-cl20/separators.jpeg" alt="" /></p>
<p>The <em>maximum-margin classifier</em> chooses the separating hyperplane that, well, maximizes the margins between the separator and the two classes.
In the below visualization, the yellow separator is the hyperplane orthogonal to the vector \(w\) that most decisively classifies every positive and negative sample correctly.
That is, none of the sample are close to the separator, and \(w\) is chosen to have the largest <em>margin</em>, or gap between the data and the separator.
The space between the solid separator and the two dashed lines is the margin, a sort of demilitarized zone between the two classes of samples.</p>
<p><img src="/assets/images/2021-10-28-cl20/margin.jpeg" alt="" /></p>
<p>In order to quantify the margin, we require that \(w\) is chosen to ensure that \(\langle w, x_i\rangle \geq y_i\) for \(y_i \in \{-1,1\}\).
The width of the margin can be computed to be at least \(\frac1{\|w\|}\) if we enforce this requirement.
Therefore, maximum-margin classifier is</p>
\[\mathop{\mathrm{arg\ min}}_{w \in \mathbb{R}^p} \|w\|, \text{ such that } \langle w, x_i\rangle \geq y_i, \ \forall i \in [n],\]
<p>where \((x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^p \times \{-1, 1\}\) are the training samples.</p>
<p>Last week’s blog post discussed in detail why maximum-margin classifiers can lead to good generalization. Primarily, having large margins means that the classifier is robust and will categorize correctly samples that are drawn near any of the training samples.
This is great, and provided a ton of insight into why overfitting is not always a bad thing. However, those results were limited in their applicability:</p>
<ul>
<li>They only apply to voting-based classifiers with margins, and this maximum-margin classifier does not <em>directly</em> aggregate together multiple weak classifiers. (One could think of a linear combination of features as being a combination of other classifiers, but those are not explicitly spelled out in the maximum-margin classifier.)</li>
<li>Their bounds only apply to perfectly clean training data; if an \(\eta\)-fraction of the samples have incorrect labels, then their bounds fall apart.</li>
</ul>
<p>This paper suggests that these kinds of bounds are possible for the maximum margin classifier when the dimension is much larger than the number of samples.</p>
<p><em>Aside: Their formulation of the maximum-margin classifier is identical to that of the</em> support vector machine (SVM)<em>. The samples that lie on the margin (in our case, two red samples and two blue samples on the dotted lines) are</em> support vectors, <em>which the separator can be written in terms of. Classical capacity-based generalization approaches for SVMs relies on having few support vectors, but <a href="https://arxiv.org/abs/2005.08054" target="_blank">some</a> <a href="https://arxiv.org/abs/2011.09148" target="_blank">recent</a> <a href="https://arxiv.org/abs/2104.13628" target="_blank">works</a> have shown that generalization bounds can be proved in a setting with many support vectors. <a href="https://arxiv.org/abs/2105.14084" target="_blank">One of my papers</a>, which will appear at NeurIPS 2021 (and which I’ll discuss in a forthcoming blog post) proves when</em> support vector proliferation <em>, a phenomena where every samples is a support vector, occurs.</em></p>
<h2 id="data-model">Data model</h2>
<p>Like the linear regression papers we’ve discussed, this paper exhibits the phenomenon of benign overfitting under strict distributional assumptions. We present a simplified version of their data model below.</p>
<ul>
<li>A label \(\tilde{y} \in \{-1,1\}\) is chosen by a coin flip. With probability \(\eta\) (which can be no larger than some constant less than 1), the label is <em>corrupted</em> and \(y = - \tilde{y}\). Otherwise, \(y = \tilde{y}\).</li>
<li>For some <em>mean vector</em> \(\mu \in \mathbb{R}^p\) and some \(q\) drawn from a \(p\)-dimensional subgaussian distribution with a lower-bound on expected norm, the input \(x\) is chosen to be \(q + \tilde{y} \mu\).</li>
</ul>
<p>That is, the inputs belong to one of two regions: either clustered around \(\mu\) if \(\tilde{y} = 1\) and \(-\mu\) if \(\tilde{y} = -1\).
Intuitively, this means the learning problem is much easier if \(\mu\) is large, because the clusters will be more sharply separated.</p>
<p><img src="/assets/images/2021-10-28-cl20/data.jpeg" alt="" /></p>
<p>The data model is limited by the fact that they assume this kind of two-cluster structure. However, it’s intended as a proof of concept of sorts, and the setup allows one to explore how changing the number of samples \(n\), the dimension \(d\), and the distinctiveness of classes \(\|\mu\|^2\) shapes which bounds are possible.</p>
<p>They give several examples of this data model, and I’ll recount their Example 3, which they call the <em>Boolean noisy rare-weak model.</em>
They sample \(y\) and \(\tilde{y}\) as above.
\(x\) is drawn from a distribution over \(\{-1,1\}^p\), where \(x_1, \dots, x_s\) independently equal \(\tilde{y}\) with probabililty \(\frac12 + \gamma\) and \(-\tilde{y}\) otherwise, for some \(s \leq p\) and \(\gamma \in (0, \frac12)\). \(x_{s+1}, \dots, x_p\) are the results of independent fair coin tosses.</p>
<h2 id="main-result">Main result</h2>
<p>Their main result is a generalization bound for this two-cluster data model.
The result relies on several assumptions about \(n\), \(d\), and \(\mu\).</p>
<p><em><strong>Theorem 4:</strong> Suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\|\mu\|^2n, n^2 \log n))\), (3) \(\|\mu\|^2 = \Omega(\log n)\), and (4) \(p = O(\|\mu\|^4 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>The main inequality is a bound on the generalization error of the classifier \(w\) because it deals with new samples, rather than the ones used to train the classifier.
The \(\eta\) term in the error is unavoidable, because any sample will be corrputed with probability \(\eta\).
The \(\epsilon\) term is the more interesting one, which governs the excess error.</p>
<p>The requirement that \(p = \Omega(n^2 \log n)\) means the model must be in a <em>very</em> high-dimensional regime. Recall that papers like <a href="/2021/07/23/hmrt19.html" target="_blank">HMRT19</a> consider a regime where \(p = \Theta(n)\); here, this paper only says anything about generalization when \(p\) is much larger than \(n\). We also require pretty specific conditions about \(\mu\).</p>
<p>To make life easier, let \(\mu = (q, 0, \dots, 0) \in \mathbb{R}^p\). The excess error can only be small then if \(q \gg p^{1/4}\). Since it must also be the case that \(q \ll \sqrt{p/ n}\), this gives a relatively narrow interval that \(p\) can belong to.</p>
<p>They formulate the theorem specifically for the example we consider as well.</p>
<p><em><strong>Corollary 6:</strong> For the Boolean noisy rare-weak model, suppose (1) \(n\) is at least some constant, (2) \(p = \Omega(\max(\gamma^2 s n, n^2 \log n))\), (3) \(\gamma^2 s = \Omega(\log n)\), and (4) \(p = O(\gamma^4 s^2 / \log(1/\epsilon))\) for some \(\epsilon >0\). Then,</em></p>
\[\mathrm{Pr}_{x,y}[\mathrm{sign}(\langle w, x\rangle \neq y)] \leq \eta + \epsilon,\]
<p><em>where \(w\) solves the max-margin optimization problem.</em></p>
<p>This means that if \(\gamma\) is some constant like \(0.25\), it must be true that \(s \gg \sqrt{p}\) and \(s \ll p/n\).
Therefore, only a small fraction of the dimensions of \(x\) can be indicative of the label \(y\), and most of the input is just noise.
Or, if \(s = p\) and every feature is significant, then \(\gamma\) must satisfy \(\gamma \ll 1/\sqrt{n}\) and \(\gamma \gg 1 / p^{1/4}\), which means that each feature will only have a minute amount of signal.
This closely resembles the kinds of settings that we showed have good generalization for linear regression in <a href="/2021/07/11/bllt19.html" target="_blank">BLLT19</a> long ago.</p>
<h2 id="proof-overview">Proof overview</h2>
<p>The proof relies on a proof by <a href="https://arxiv.org/abs/1710.10345" target="_blank">SHNGS18</a> that using gradient descent to optimize logistic regression for separable data gives a separating hyperplane that maximizes margins.
That is, gradient descent with a logistic loss function has an implicit bias that leads to the same solution as to that of an SVM.</p>
<p>In Lemma 9, they use a simple concentration bound to show that the generalization error is small if \(\langle w, \mu\rangle\) is small, where \(\mu\) is the mean vector and \(w\) is the learned classifier.
They relate this to the classifiers obtained in each step of gradient descent \(v^{(t)}\) and bound \(\langle v^{(t)}, \mu\rangle\) by expanding the gradient step to write \(v^{(t)}\) in terms of all previous risks.
Taking a limit of \(t \to \infty\) relates this to the maximum-margin classifier.</p>
<p>Lemma 10 lower-bounds the target inner product. A key component of the proof of that is Lemma 14, which shows that the loss caused by any one sample cannot be much more than that of any other sample with high probability.
This is important because it means that the noisy samples (with flipped \(\mu\)) cannot have outsize impact on the result, and that the analysis is robust to those errors.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>This paper was neat, since it showed something similar to what was uncovered about minimum-norm linear regression by a variety of papers previously surveyed.
It’s neat to also see this as a strengthening of the margin work discussed last week under boosting, since these results work for samples with noisy labels and for non-voting margin-based classifiers.</p>
<p>However, they’re limited by degree of over-parameterization/the size of the dimension needed; \(p = \Omega(n^2 \log n)\) is a pretty steep requirement, especially since results like my <a href="https://arxiv.org/abs/2105.14084" target="_blank">OLS=SVM paper</a> suggest that minimum-norm regression (with samples drawn with labels in \(\{-1,1\}\)) and maximum-margin classifiers coincide when \(p = \Omega(n \log n)\).
They specifically identify the improvement on the dependence of \(p\) as motivation for future work, and I hope to see that tackled at some point.</p>
<p><em>Thanks for reading this week’s entry! The actual exam is coming up on November 16th, and you should expect at least two more posts about papers before then!</em></p>Clayton Sanford[OPML#8] FS97 & BFLS98: Benign overfitting in boosting2021-10-20T00:00:00+00:002021-10-20T00:00:00+00:00http://blog.claytonsanford.com/2021/10/20/boosting<!-- [[OPML#8]](/2021/10/20/boosting.html){:target="_blank"} -->
<p><em>This is the eighth of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p><em>In other news, there’s <a href="https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-of-deep-learning-20211011/" target="_blank">a cool Quanta article</a> that touches on over-parameterization and the analogy between neural networks & kernel machines that just came out. Give it a read!</em></p>
<p>When conducting research on the theoretical study of neural networks, it’s common to joke that one’s work was “scooped” by a paper in the 1990s.
There’s a lot of classic ML theory work that was published well before the deep learning boom of the last decade.
As a result, it’s common for researchers to ignore it and unknowingly repackage old ideas as novel.</p>
<p>This week, I finally escape my pattern of discussing papers from the ’10s and ’20s by presenting a pair of seminal papers from the late ’90s: <a href="https://www.sciencedirect.com/science/article/pii/S002200009791504X" target="_blank">FS97</a> and <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-26/issue-5/Boosting-the-margin--a-new-explanation-for-the-effectiveness/10.1214/aos/1024691352.full" target="_blank">BFLS98</a>.
Both of these papers cover <em>boosting</em>, a learning algorithm that aggregates many <em>weak learners</em> (heuristics that perform just better than chance) into a much better prediction rule.</p>
<ul>
<li>FS97 introduces the <em>AdaBoost</em> algorithm, proves that it can combine weak learners to perfectly fit a training dataset, and gives generalization bounds based on VC-dimension.
The authors note that empirically, the algorithm performs much better than these capacity-based bounds and exhibits some form of <em>benign overfitting</em> (which has been extensively discussed in posts like <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="/2021/09/11/xh19.html" target="_blank">[OPML#6]</a>).</li>
<li>BFLS98 addresses that mystery and resolves it by giving a different type of generalization bound, a <em>margin-based bound</em>, which explains why the generalization performance of AdaBoost continues to improve after it correctly classifies the training data.</li>
</ul>
<p>These papers fit into the series because they exhibit a very similar phenomenon to the one we frequently encounter with over-parameterized linear regression and in deep neural networks:
A learning algorithm is trained to zero training error and has small generalization error, despite capacity-based generalization bounds suggesting that this should not occur.
Moreover, the generalization error continues to decrease as the model becomes “more over-parameterized” and continues to train beyond zero training error.
These papers highlight the significance of <em>margin bounds</em>, which have been studied in papers <a href="https://arxiv.org/abs/1909.12292" target="_blank">like</a> <a href="https://arxiv.org/abs/1706.08498" target="_blank">these</a> in the context of neural network generalization.</p>
<p>We’ll jump in by explaining boosting, before discussing capacity-based and margin-based generalization bounds and the connection to benign overfitting.</p>
<h2 id="boosting">Boosting</h2>
<p>We motivate and discuss the boosting algorithm presented in FS97.</p>
<h3 id="population-training-and-generalization-errors">Population, training, and generalization errors</h3>
<p>To motivate the problem, consider a setting where the goal is to learn a classifier from training data.
That is, you (the learner) have \(m\) samples \(S = \{(x_1, y_1), \dots, (x_m, y_m)\} \subset X \times \{-1,1\}\) drawn independently from some distribution \(\mathcal{D}\).
The goal is to learn some <em>hypothesis</em> \(h: X \to \{-1,1\}\) with low population error, that is</p>
\[\text{err}_{\mathcal{D}}(h) = \text{Pr}_{(x, y) \sim \mathcal{D}}[h(x) \neq y].\]
<p>To do so, we follow the strategy of <em>empirical risk minimization</em>, that is choosing the \(h\) that minimizes <em>training error</em>:</p>
\[\text{err}_S(h) = \sum_{i=1}^m \mathbb{1}\{h(x_i) \neq y_i\}.\]
<p>Often, the goal is to obtain a <em>PAC learning</em> (Probably Approximately Correct learning) guarantee, which entails showing that there exists some learning algorithm that gives a hypothesis \(h\) with probability \(1 - \delta\) such that \(\text{err}_{\mathcal{D}}(h) \leq \epsilon\) in time \(O(\frac{1}{\epsilon}, \frac1\delta)\) for any small \(\epsilon, \delta > 0\).</p>
<p>We can decompose the population error into two terms and analyze when algorithms succeed and fail based on the two:</p>
\[\text{err}_{\mathcal{D}}(h) = \underbrace{\text{err}_{\mathcal{D}}(h)-\text{err}_S(h)}_{\text{generalization error}} + \underbrace{\text{err}_S(h).}_{\text{training error}}\]
<p>This framing implies two very different types of failure modes.</p>
<ol>
<li>If the training error is large when \(h\) is an empirical risk minimizing hypothesis, then there is a problem with expressivity. In other words, there is no hypothesis that closely fits the training data, which means that there is very likely no hypothesis will succeed on random samples drawn from \(\mathcal{D}\).</li>
<li>If the generalization error is large, then the sample \(S\) is not representative of the distribution \(\mathcal{D}\). <em>Overfitting</em> refers to the issue where the training error is small and the generalization error is large; the hypothesis does a good job memorizing the training data, but it learns little of the actual underlying learning rule because there aren’t enough samples. This typically occurs when \(h\) comes from a family of hypotheses that are <em>too complex.</em></li>
</ol>
<p>We can visualize these trade-offs with respect to the model complexity below, as they’re understood by traditional capacity-based ML theory. (There’s a very similar image in the introductory post of this blog series.)</p>
<p><img src="/assets/images/2021-10-20-boosting/descent.jpeg" alt="" /></p>
<p>While these blog posts focus on problematizing this picture by exhibiting cases where there is <em>both</em> overfitting and low generalization error, we introduce boosting in the context of solving the opposite problem: What do you do when the model complexity is too low, and no hypotheses do a good job of even fitting the training data?</p>
<h3 id="limitations-of-linear-classifiers">Limitations of linear classifiers</h3>
<p>Consider the following picture:</p>
<p><img src="/assets/images/2021-10-20-boosting/redblue.jpeg" alt="" /></p>
<p>Suppose our goal is to find the best linear classifier that separates the red data (+1) from the blue data (-1) and (ideally) will also separate new red data from new blue data.
However, there’s an immediate problem: no linear classifier can be drawn on the training data without a training error better than \(\frac13\). For instance, the following separator (which labels everything with \(\langle w, x\rangle > 0\) red and everything else blue) for some vector \(w \in \mathbb{R}^2\) performs poorly on the upper “slice” of red points and the lower slice of blue points.</p>
<p><img src="/assets/images/2021-10-20-boosting/line1.jpeg" alt="" /></p>
<p>Neither of these are any good either.</p>
<p><img src="/assets/images/2021-10-20-boosting/line23.jpeg" alt="" /></p>
<p>All three of the above linear separators have roughly a \(\frac23\) probability of classifying a sample correctly, but they each miss a different slice of the data.
A natural question to ask is: Can these three separators be combined in some way to improve the training error of the classifier?</p>
<p>The answer is yes. By taking a <em>majority vote</em> of the three, one can correctly classify all of the data. That is, if at least two of the three linear classifiers think the point is red, then the final classifier predicts that the point is red.
The following is a visualization of how this voting scheme works. (Maroon regions have 2 separators saying “red” and are classified as red. Purple regions have 2 separators saying “blue” and are classified as blue.)</p>
<p><img src="/assets/images/2021-10-20-boosting/vote.jpeg" alt="" /></p>
<p>We increase the complexity of the model (by aggregating together three different classifiers), which gets us down to zero training error in this case.
This helps solve the issue about approximation–but it presents a new one on generalization. Can we expect this new “voting” classifier to perform well, since it’s more complex than just the linear classifier?</p>
<p><em>Boosting</em> is an algorithm that formalizes this voting logic in order to string together a bunch of weak classifiers into one that performs well on all of the training data. In the last two sections of the blog post, we give two takes on generalization of boosting approaches, to answer the aforementioned question about whether we expect this kind of overfitting to hurt or not.</p>
<h3 id="weak-learners">Weak Learners</h3>
<p>The linear classifiers above are examples of <em>weak learners</em>, which perform slightly better than chance on the training data and which we combine together to make a stronger learner.</p>
<p>To formalize that concept, we say that a learning algorithm is a <em>weak learning algorithm</em> or a <em>weak learner</em> if it can PAC-learn a family of functions \(\mathcal{C}\) with error \(\epsilon = \frac12 - \eta\) with probability \(1- \delta\) where samples are drawn from some distribution \(\mathcal{D}\).</p>
<p>The idea with weak learning in the context of boosting is that you use the weak learning algorithm to obtain a classifier \(h\) that weak-learns the family over some weighted distribution of the samples.
Then, the distribution can be modified accordingly, in order to ensure that the next weak learner performs well on the samples that the original hypothesis performed poorly on.
In doing so, we gradually find a cohort of weak classifiers, such that each sample is correctly classified by a large number of weak learners in the cohort.</p>
<p><img src="/assets/images/2021-10-20-boosting/wl.jpeg" alt="" /></p>
<p>The graphic visualizes this flow.
The top-right image represents the first weak classifier found on the distribution that samples evenly from the training data. It performs well on at least \(\frac23\) of the samples.
Then, we want the weak learning algorithm to give another weak classifier, but we want it to be different and ensure that other samples are correctly classified, particularly the ones misclassified by the first one.
Therefore, we amplify those misclassified samples in the distribution (bottom-left) and learn a new learning rule on that reweighted distribution.
For that learning rule to qualify as a weak learner, it must classify \(\frac23\) of the <em>weighted</em> samples correctly. To do so, it’s essential that it correctly classifies the previously-misclassified samples.
Hence, it chooses a different rule.
Continuing to iterate this will give a wide variety of weak learners.</p>
<p>This intuition is formalized in the AdaBoost algorithm.</p>
<h3 id="adaboost">AdaBoost</h3>
<p>Here’s how the algorithm works, as stolen from FS97.</p>
<ul>
<li>Input: some input set of samples \((x_1, y_1), \dots, (x_m, y_m)\), a number of rounds \(T\), and a procedure <strong>WeakLearn</strong> that outputs a weak learner given a distribution over samples.</li>
<li>Initialize \(w^1 = \frac{1}{m} \vec{1} \in [0,1]^m\) to be a uniform starting distribution over training samples. (Note: the algorithm in the paper works for a general starting distribution, but we stick to the uniform distribution for simplicity.)</li>
<li>For round \(t \in [T]\), do the following:
<ol>
<li>Update the probability distribution by normalizing the current weight vector: \(p^t = \frac{1}{\|w^t\|_1} w^t.\)</li>
<li>Use <strong>WeakLearn</strong> to obtain a weak learner \(h_t: X \to [-1,1]\).</li>
<li>Calculate the error of \(h^t\) on the <em>weighted</em> training samples: \(\epsilon_t = \frac12 \sum_{i=1}^m p_i^t \lvert h_t(x_i) - y_i\rvert\). (Note: this differs by a factor of \(\frac12\) from the version presented in the paper because we assume the output of the functions to be \([-1,1]\) rather than \([0,1]\).)</li>
<li>Let \(\beta_t = \frac{\epsilon_t}{1 - \epsilon_t} \in (0,1)\) inversely represent roughly how much weight should be assigned to \(h^t\) in the final classifier. (If \(h^t\) has small error, then it’s a “helpful” classifier that should be given more priority.)</li>
<li>Adjust the weight vector by de-emphasizing samples that were accurately classified by \(h_t\). For all \(i \in [m]\), let</li>
</ol>
\[w_i^{t+1} = w_i^t \beta_t^{1 - |h_t(x_i) - y_i|}.\]
</li>
<li>
<p>Output the final classifier, a weighted majority vote of the weak learners:</p>
\[h_f(x) = \text{sign}\left(\sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t} \right).\]
<p>(This also differs from the final hypothesis in the paper because of the difference in output.)</p>
</li>
</ul>
<p>This formalizes the process illustrated above, where we rely on <strong>WeakLearn</strong> to produce learning rules that perform well on samples that have been misclassified frequently in the past.</p>
<p>Why is it called <strong>Ada</strong>Boost?
Unlike previous (less famous) boosting algorithms, it doesn’t require that all of the weak learners have minimum accuracy that is known to the algorithm.
Rather, it can work with all errors \(\epsilon_t\) and hence <em>adapt</em> to the samples given.</p>
<p>It’s natural to ask about the theoretical properties of the algorithm.
Specifically, can AdaBoost successfully aggregate a bunch of weak learners into a “strong learner” that classifies all but an \(\epsilon\) fraction of the training samples for any \(\epsilon\)?
And if so, how many rounds \(T\) are needed?
And how small must we expect \(\epsilon_t\) (the accuracy of each weak learner) to be?
This leads us to the main AdaBoost theorem.</p>
<p><em><strong>Theorem 1</strong> [Performance of AdaBoost on training data, Theorem 6 of FS97]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the error of the final hypothesis \(h_f\) is bounded by</em></p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)}.\]
<p>From this, one can naturally ask: How long will it take to classify all of the training data? For that to be the case, it suffices to show that \(\epsilon < \frac1m\), because there are only \(m\) samples and they cannot be “fractionally” correct.</p>
<p>For the sake of simplicity, we calculate the \(T\) necessary for \(\epsilon_t \leq 0.4\). (That is, each weak learner has advantage at least 0.1.)</p>
\[\epsilon \leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t(1 - \epsilon_t)} \leq 2^T (0.24)^{T/2} = (2 \sqrt{0.24})^T < \frac{1}{m},\]
<p>which occurs when</p>
\[T > \frac{\log m}{\log (1 / (2 \sqrt{0.24}))} \approx 113 \log m.\]
<p>This is a really nice bound to have! It tells us that the training error can be rapidly bounded, despite only having the ability to aggregate classifiers that perform slightly better than chance.</p>
<p>The proof is simple and elegant, and I’m not going into it much.
It’s well-explained by the paper, but much of it boils down to the intuition that if a training sample is neglected by many weak learners, then its emphasis continues to increase until it can no longer be ignored without meeting the weak learnability error guarantees.</p>
<p>Despite all of these nice things, this theorem is limited. It only covers the performance of the weighted majority classifier on the training data and says nothing about generalization.
Indeed, it’s reasonable to fret about the generalization performance of this aggregate classifier.
If we substantially increased the expressibility of the weak learning classifiers by combining them, then wouldn’t capacity-based generalization theory tell us that this will trade-off generalization?
And isn’t it further compromised by the fact that training for a relatively small number of rounds leads to an aggregate hypothesis that perfectly fits the training data?</p>
<p>We focus for the remainder of the post on generalization, first examining it through the lens of classical capacity-based generalization theory, as done by FS97.</p>
<h2 id="capacity-based-generalization">Capacity-based generalization</h2>
<p>Looking back on the first visual of this post, classical learning theory has a simple narrative for what boosting does:</p>
<ul>
<li>The individual weak classifiers provided by <strong>WeakLearn</strong> lie on the left side of the curve (low generalization error, high training error) because they have a poor training error. Thus, they cannot fit complex patterns and are likely intuitively “simple,” which could translate to a low VC-dimension and hence a low generalization error.</li>
<li>As each stage of the boosting algorithm runs, the aggregate classifier moves further to the right, improving training error at the cost of generalization error. After sufficiently many rounds \(T\) have occurred to drive the training error to zero, the generalization will be so large as to make any bound on population error vacuous.</li>
</ul>
<p>This intuition is made explicit by the generalization bound presented by FS97, which bounds the VC-dimension of a majority vote of classifiers with individual VC-dimension at most \(d\) and applies the standard VC-dimension bound on generalization.</p>
<p>They get the following bound, which combines their Theorem 7 and Theorem 8.</p>
<p><em><strong>Theorem 2</strong> [Capacity-based generalization bound] Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Suppose <strong>WeakLearn</strong> outputs hypotheses from a class \(\mathcal{H}\) having \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), the following inequality holds for all final hypotheses that can returned by AdaBoost \(h_f\):</em></p>
\[\text{err}_{\mathcal{D}}(h_f) \leq \underbrace{\text{err}_{S}(h_f)}_{\text{training error}} + \underbrace{O\left(\sqrt{\frac{dT\log(T)\log(m/dT) + \ln\frac1{\delta}}{m}}\right).}_{\text{generalization error}}\]
<p>This bound fits cleanly into the intuition described above.
To keep the generalization small, \(T\) and \(d\) must be kept small relative to the number of samples. Doing so forces the training error to be large, because Theorem 1 suggests that \(h_f\) will have small training error when (1) AdaBoost runs for many iterates (large \(T\)) or (2) <strong>WeakLearn</strong> produces accurate classifiers, which requires an expressive family of weak learners (large \(d\)).
Hence, we’re necessarily trading off the two types of error.</p>
<p>However, this isn’t the full story.
When running experiments, they confirmed that after many rounds, the training error approached zero (as expected by Theorem 1).
But they also found that the test error dropped along with the training error <em>and</em> that the test error continued to drop even after the training error went to zero.
To explain this phenomenon, we turn to BFLS98, where the authors explain this low generalization error using <em>margin-based</em> bounds rather than capacity-based bounds.</p>
<p><img src="/assets/images/2021-10-20-boosting/general.jpeg" alt="" /></p>
<h2 id="margin-based-generalization">Margin-based generalization</h2>
<p>A key idea in the story about margin-based generalization is that a classifier that correctly and <em>decisively</em> categorizes all the training data is more robust (and more likely to generalize) than one that nearly categorizes samples incorrectly.
Roughly, slightly perturbing the samples in the first case will lead to samples that have the same labels, while that may not be the case in the second case.</p>
<p>Analyzing this requires considering some notion of <em>margin</em>, which quantifies the decisiveness of the classification.
For now, consider a modified version of the weighted majority classifier derived from AdaBoost:</p>
\[h_f(x) = \sum_{t=1}^T h_t(x) \log\frac{1}{\beta_t}.\]
<p>The only difference here is that we dropped the \(\text{sign}\) function, which means the output may be anywhere in \([-1,1]\).
\(h_f\) categorizes the sample \((x,y)\) correctly if \(yh_f(x) > 0\), because the sign of \(h_f\) will then match \(y\).
We say that \(h_f\) categorizes a sample correctly <em>with margin \(\theta > 0\)</em> if \(yh_f(x) \geq \theta\).
This means that–if \(h_f\) is an aggregation of a large number of weak classifiers–then a small number of those classifiers changing their outcomes will not change the overall outcome of \(h_f\).</p>
<p>There are two key steps that lead to new generalization bounds by BFLS98 for AdaBoost.</p>
<ol>
<li>AdaBoost (after sufficiently many rounds \(T\) and with sufficiently small weak learner errors \(\epsilon_t\)) will classify the sample \(S\) correctly with some margin \(\theta\).</li>
<li>Any linear combination of \(N\) classifiers (each of which has bounded VC dimension) with margin \(\theta\) on the training data has a generalization bound that depends on \(\theta\) and <em>not</em> on \(N\).</li>
</ol>
<p>They accomplish (1) by proving a theorem that is very similar in flavor and proof to the Theorem 1 we gave earlier.</p>
<p><em><strong>Theorem 3</strong> [Margins of AdaBoost on training data, Theorem 5 of BFLS98]: Suppose <strong>WeakLearn</strong> generates hypotheses with errors at most \(\epsilon_1,\dots, \epsilon_T\). Then, the final hypothesis \(h_f: X \to [-1,1]\) satisfies the following margin bound on the training set \((x_1, y_1), \dots, (x_m, y_m)\) for any \(\theta \in [0,1)\):</em></p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i) \leq \theta \}\leq 2^T \prod_{t=1}^T \sqrt{\epsilon_t^{1-\theta}(1 - \epsilon_t)^{1 + \theta}}.\]
<p>To make matters more concrete once again, consider the case where \(\epsilon_t \leq 0.4\) as before.
Then, the bound gives</p>
\[\frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} \leq 2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2}.\]
<p>If we want all training samples to obey the condition, we enforce that the margin term is less than \(\frac1{m}\).
Consider two cases:</p>
<ul>
<li>By some calculations (with the help of WolframAlpha), if \(\theta = 0.1\), then \(y_i h_f(x_i) \geq \theta\) for all \(i \in [m]\) if \(T > 7260 \log m\). This is very similar to our application of Theorem 1, albeit with bigger constants.</li>
<li>
<p>If \(\theta = 0.2\), then</p>
\[2^T (0.4)^{T(1- \theta)/2} (0.6)^{T(1 + \theta)/2} = 2^T (0.4)^{0.4T}(0.6)^{0.6T} \approx 1.02^T,\]
<p>which means that the bounds can never guarantee that the margins will be that large with time.</p>
</li>
</ul>
<p>These bounds provide a way of finding a margin \(\theta\) dependent on \(T\) and errors \(\epsilon_1, \dots, \epsilon_T\), which will be useful in the second part.</p>
<p>To get (2), they prove a bound on the combination of weak learners with margin bounds.</p>
<p><em><strong>Theorem 4</strong> [Margin-based generalization; Theorem 2 of BFLS98]: Consider some distribution \(\mathcal{D}\) over labeled data \(X \times \{-1,1\}\) with some sample \(S\) of size \(m\) drawn from \(\mathcal{D}\). Let \(\mathcal{H}\) be a family of “base classifiers” (weak learners) with \(VC(\mathcal{H}) = d\). Then, with probability \(1 - \delta\), any weighted average \(h_f(x) = \sum_{j=1}^T p_j h^j(x)\) for \(p_j \in [0,1]\), \(\sum_j p_j = 1\), and \(h^j \in \mathcal{H}\) satisfies the following inequality:</em></p>
\[\text{err}_{\mathcal{D}}(h_f) = \text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \frac1{m} \sum_{i=1}^m \mathbb1\{y_ih_f(x_i)\leq \theta\} + O\left(\sqrt{\frac{d \log^2(m/d)}{m\theta^2} + \frac{\log(1/\delta)}{m}}\right).\]
<p>This is fantastic compared to Theorem 2 because the generalization bound does not worsen as \(T\) increases.
The opposite effect actually occurs: as AdaBoost continues to run, Theorem 3 shows that the margin increases (up to a point), which strengthens the bound without trade-off!</p>
<p>We can instantiate the bound in the setting described above to show what a nice generalization bound can look like for boosting. If, once again, \(\eta_t \leq 0.4\), then taking \(\theta = 0.1\) and \(T = 7260\log m\) gives</p>
\[\text{err}_{\mathcal{D}}(h_f) = O\left(\sqrt{\frac{d \log^2(m/d) + \log(1/\delta)}{m}} \right).\]
<p>In this case, we can have our cake and eat it too; we increase the model complexity and expressivity by increasing \(T\), but we don’t sustain the basic trade-offs between training and generalization error discussed at the beginning of the post.</p>
<p>To illustrate why, we give a high-level overview of the proof and show how the rough intuition that “decisive classification leads to robustness, leads to generalization” holds up.</p>
<ul>
<li>The proof uses an approximation of \(h_f = \sum_{j=1}^T p_j h^j\) by sampling \(N\) classifiers \(\hat{h}_1, \dots, \hat{h}_N\) independently from \(h^1, \dots, h^T\) weighted by \(p_1, \dots, p_T\). It averages them together to obtain \(g = \frac1{N} \sum_{k=1}^N \hat{h}_k.\)</li>
<li>
<p>The proof decomposes the population error term into other quantities by using properties of conditional probability:</p>
\[\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0] \leq \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}\right] + \text{Pr}_{\mathcal{D}}\left[y g(x) \leq \frac{\theta}{2}, y h_f(x) \leq 0\right].\]
</li>
<li>The second term can be shown to be small when \(N\) and \(\theta\) large with high probability over \(g\) by a Chernoff bound. Since \(h_f = \mathbb{E}[g] = \mathbb{E}[\hat{h}_k]\), it’s unlikely that \(yg(x)\) and \(yh_f(x)\) will differ by a large factor from one another.</li>
<li>By principles of VC dimension, the <a href="https://en.wikipedia.org/wiki/Sauer%E2%80%93Shelah_lemma" target="_blank">Sauer-Shelah lemma</a>, and concentration bounds (this time over the <em>sample</em>) for large \(m\), the first term will be roughly the same as \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}.\)</li>
<li>
<p>Using the same conditional probability argument as before, that same term can be decomposed into</p>
\[\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \} \leq \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i f(x_i) \leq \theta \} + \frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 , y_i f(x_i) \leq \theta\}.\]
</li>
<li>Using Chernoff bounds shows the second term of the expression is small with high probability over \(g\). Thus, the \(\text{Pr}_{\mathcal{D}}[y h_f(x) \leq 0]\) is approximately \(\frac1{m} \sum_{i=1}^m \mathbb{1}\{ y_i g(x_i) \leq \theta / 2 \}\), plus an error term that accumulates as a result of the concentration bounds.</li>
<li>Having a large \(\theta\) means that we have plenty of room for the Chernoff bounds over \(g\) to be strong, which corresponds to the <em>robustness</em> discussed before. If \(\theta\) were small, then it would be very easy to have \(yf(x) \leq 0\) and \(yg(x) \geq \theta/2\) simultaneously, which would make the argument impossible.</li>
</ul>
<h2 id="last-thoughts">Last thoughts</h2>
<p>I read these boosting papers in 2017 while taking my first graduate seminar, which surveyed a variety of papers in ML theory.
I enjoyed the papers then, but the remarkability of this generalization result was lost on me at the time.
Now, I find this much more exciting because it gives a setting where a model can obtain provably great generalization error despite overfitting the data and being “over-parameterized.” (If we count the number of parameters used in all of the classifiers that vote, there can be many more parameters than samples \(m\).)
The proof is elegant and does not require strange and adversarial distributions over training data.
Granted, the assumption that there exists a weak learner that always returns a classifier with error at most (say) 0.4 is a strong one, but the result is remarkable nonetheless.</p>
<p>Thanks for reading! Leave a comment if you have any thoughts or questions. (As long as the comments system isn’t buggy on your end–I’m still sorting out some issues.) See you next time!</p>Clayton Sanford[OPML#7] BLN20 & BS21: Smoothness and robustness of neural net interpolators2021-09-22T00:00:00+00:002021-09-22T00:00:00+00:00http://blog.claytonsanford.com/2021/09/22/bubeck<p><em>This is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam.
Check out <a href="/2021/07/04/candidacy-overview.html" target="_blank">this post</a> to get an overview of the topic and a list of what I’m reading.</em></p>
<p>This post discusses two papers by Sebastian Bubeck and his collaborators that are of interest to the study of over-parameterized neural networks. The first, <a href="https://arxiv.org/abs/2009.14444" target="_blank">“A law of robustness for two-layers neural networks” (BLN20)</a> with Li and Nagaraj, gives a conjecture about the “robustness” of a two-layer neural network that interpolates all of the training data. The second, <a href="https://arxiv.org/abs/2105.12806" target="_blank">“A universal law of robustness via isoperimetry” (BS21)</a> with Sellke, proves part of the conjecture and extends that part of the conjecture to deeper neural networks.
The other part of the conjecture remains open for future work to tackle.</p>
<p>Both papers consider a setting where there are \(n\) training samples \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}\) drawn from some distribution that are fit by a neural network with \(k\) neurons.
For the two-layer case (which we’ll focus on in this writeup), they consider neural networks of the form</p>
\[f(x) = \sum_{j=1}^k u_j \sigma(w_j^T x + b_j),\]
<p>where \(\sigma(t) = \max(0, t)\) is the ReLU activation function and \(w_j \in \mathbb{R}^d\) and \(b_j, u_j \in \mathbb{R}\) are the parameters.
Roughly, they ask whether there exists a “smooth” neural network \(f\) such that \(f(x_i) \approx y_j\) for all \(j \in [n]\); this makes \(f\) an approximate interpolator.</p>
<p><em>How does this relate to the rest of this blog series?</em>
All of the other posts so far have been about cases where over-parameterized linear regression leads to favorable generalization performance.
These generalization results occur due to the smoothness of the linear prediction rule.
That is, if we have some prediction rule \(x \mapsto \beta^T x\) for \(x, \beta \in \mathbb{R}^d\) with \(d \gg n\), we might have good generalization if \(\|\beta\|_2\) is small, which is enabled when \(d\) is very large.
The same observation holds up with neural networks (over-parameterized models leads to benign overfitting), but it’s harder to prove why it leads to a small generalization error.
By understanding the smoothness of interpolating neural networks, it might make it easier to prove generalization bounds on the neural networks that perfectly fit the training data.</p>
<p><em>How do they measure smoothness?</em>
For linear regression, it’s natural to think of the smoothness of the prediction rule \(f_{\text{lin}}(x) = \beta^T x\) as \(\|\beta\|_2\), since that is the magnitude of the gradient \(\|\nabla f_{\text{lin}}(x)\|_2\) at every sample \(x\).
For two-layer neural networks—which are non-linear functions—it’s natural instead to consider the maximum norm of the gradient of \(f\), which is represented by the Lipschitz constant of \(f\): the minimum \(L\) such that \(|f(x) - f(x')| \leq L \|x - x'\|_2\) for all \(x, x'\). (Lipschitzness also comes up frequently in my <a href="/2021/08/15/hssv21.html" target="_blank">COLT paper about the approximation capabilities of shallow neural networks</a>.)</p>
<p><em>What does it have to do with robustness?</em>
Typically, robustness is discussed in the context of adversarial examples.
If you’ve hung around the ML community, you’ve probably seen this issue featured in images like this:</p>
<p><img src="/assets/images/2021-09-22-bubeck/panda.png" alt="" /></p>
<p>Here, an image of a panda is provided that a trained image classification neural network clearly identifies as such.
However, a small amount of noise can be added to the image that leads to the network being tricked into thinking that it’s a gibbon instead.
Put roughly, it means that the network outputs \(f(x) = \text{"panda"}\) and \(f(x + \epsilon \tilde{x}) = \text{"gibbon"}\) for some \(x\) and \(\tilde{x}\), which means that the output of \(f\) changes greatly near \(x\).
By mandating that \(f\) have a small Lipschitz constant, these kinds of fluctuations are impossible.
This makes the network \(f\) <em>robust</em>.
Thus, enforcing smoothness conditions is a way to ensure that a predictor is robust to these kinds of adversarial examples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/smooth.jpeg" alt="" /></p>
<p>As a result, Bubeck and his collaborators want to characterize the availability of interpolating networks \(f\) that are also robust, with the hopes of understanding how over-parameterization can be used to avoid having adversarial examples.</p>
<p>One important caveat: Unlike the previous papers discussed in this series, this one focuses only on approximation and not optimization.
It asks whether <em>there exists</em> an interpolating prediction rule that is smooth, but it does not ask whether this rule can be easily obtained from stochastic gradient descent.</p>
<p>For the rest of the post, I’ll discuss the conjecture made by BLN20, share the support for the conjecture that was provided by BLN20 and BS21, and discuss what remains to be studied in this space.</p>
<h2 id="the-conjecture">The conjecture</h2>
<p>For simplicity, BLN20 considers only samples drawn uniformly from the unit sphere: \(x \in \mathbb{S}^{d-1}= \{x \in \mathbb{R}^d: \|x\|_2=1\}\) with iid labels \(y_i \sim \text{Unif}(\{-1,1\})\).
The conjecture of BLN20, which combines their Conjectures 1 and 2 is as follows:</p>
<p><em>Consider some \(k \in [\frac{cn}{d}, Cn]\) for constants \(c\) and \(C\). With high probability over \(n\) random samples from some distribution, there exists a 2-layer neural network \(f\) of width \(k\) that perfectly fits the data such that \(f\) is \(O(\sqrt{n/k})\)-Lipschitz.
Furthermore, any neural network that fits the data must be \(\Omega(\sqrt{n/k})\)-Lipschitz with high probability.</em></p>
<p>If true, the conjecture suggests there can only be an \(O(1)\)-Lipschitz interpolating neural network \(f\) if the model is highly over-parameterized, or \(k = \Omega(n)\).
Note that \(k\) is the number of neurons, and not the number of parameters.
In the case of a 2-layer neural network, the number of parameters is \(p = kd\), so there must be at least \(p = \Omega(nd)\) parameters for the interpolating network to be smooth.</p>
<p>The conditions with constants \(c\) and \(C\) are necessary for the question to be well-posed.</p>
<ul>
<li>Without the \(k \leq Cn\) constraint, there theorem would imply the existence of neural networks that fit the data and are \(o(1)\)-Lipschitz. However, this is not possible unless all training samples are have the same label \(y_i\); otherwise, there are at least two different samples \(x_i\) and \(x_j\) that are at most distance 2 apart (since both lie on \(\mathbb{S}^{d-1}\)) and have opposite labels. This implies that any function fitting both samples must be at least 1-Lipschitz.</li>
<li>Without the \(k \geq \frac{cn}{d}\) constant, there is unlikely to exist any neural network with \(k\) neurons that can fit the \(n\) samples. Since the number of parameters \(p\) is roughly \(kd\), letting \(k \ll \frac{n}{d}\) would ensure that \(p \ll n\) and there are fewer parameters than samples. Intuitively, it’s difficult to fit a large number of points with random labels when there are fewer parameters than samples. This suggests that the model must be over-parameterized for interpolation to even occur in the first place, let alone be smooth.</li>
</ul>
<p>BLN20 shows that the conjecture holds up empirically on toy data.
For many values of \(n\) and \(k\), they train several neural networks to fit the \(n\) samples with 2-layer neural networks of width \(k\) and randomly sample gradients to find the one with the largest magnitude.
When plotted, they note a nice linear relationship between the norms of the largest random gradient and \(\sqrt{n/k}\).
Of course, the maximum random gradient is not the same as the Lipschitz constant, since it’s impossible to check the gradient for all values of \(x\) simultaneously, but this suggests that it’s likely that the conjecture is correct.</p>
<p><img src="/assets/images/2021-09-22-bubeck/plot.png" alt="" /></p>
<h2 id="partial-upper-bounds-from-bln20">Partial upper bounds from BLN20</h2>
<p>The BLN20 papers focuses on presenting the conjecture and giving a series of partial results that suggest it may be true. In this section, we give a brief summary of each of the partial solutions.</p>
<p>The following are all partial solutions to the upper bound. That is, they show weaker versions of the claim that there exists neural network \(f\) with Lipschitz constant \(O(\sqrt{n/ k})\) by showing either larger bounds on the Lipschitz constant or more restrictive parameter regimes.</p>
<ul>
<li><strong>The high-dimensional case (3.1).</strong> If \(d \gg n\), then a ReLU network with a single neuron \(k = 1\) can be used to perfectly fit the data.
This is because a single \(d\)-dimensional hyperplane will be able to fit the \(n\) samples, so one can just choose the hyperplane with the lowest magnitude that fits the data and use a ReLU that corresponds to that hyperplane. By similar analysis to that of linear regression, the Lipschitz constant of this network will be \(O(\sqrt{n})\) with high probability, which is the same as \(O(\sqrt{n/ k})\). This can’t be improved without using more neurons.
<img src="/assets/images/2021-09-22-bubeck/single.jpeg" alt="" /></li>
<li><strong>The wide (“optimal size”) regime: \(k = n\) (3.2).</strong> With high probability, an \(10\)-Lipschitz network \(f\) can be provided by using a ReLU for every sample. Each ReLU is treated as a “cap” that gives a sample the correct label. With high probability, the points will be sufficiently spread apart in \(\mathbb{S}^{d-1}\) to ensure that none of the the caps overlap. This makes the norm of the gradient never more than \(10\), if each cap is offset by \(\frac{1}{10}\).
<img src="/assets/images/2021-09-22-bubeck/cap.jpeg" alt="" /></li>
<li><strong>The compromise case (3.3).</strong> The two previous approaches can be combined for a broader choice of \(k\) and \(n\) by instead having each ReLU perfectly fit \(m := n/k \leq d\) samples in a cap. However, since these are bigger and more complex caps then before, we need to be more concerned about the caps overlapping. They show that \(O(m \log d)\) caps will overlap at any given point, which means that the Lipschitz constant will be \(O(n\log (d) / k)\). Even disregarding the logarithmic factor, this is still much weaker than the \(O(\sqrt{d/k})\) factor that the conjecture desires.
<img src="/assets/images/2021-09-22-bubeck/combo.jpeg" alt="" /></li>
<li><strong>The very low-dimensional case with a weird architecture (3.4).</strong>
They prove the existence of a neural network that fits \(n\) samples and has Lipschitz constant \(O(\sqrt{n / k})\) with high probability. To do so, however, they need several major caveats:
<ul>
<li>The dimension \(d\) is very small; for some constant even integer \(q\), \(k = C_q d^{q-1}\) and \(n \approx \frac{d^q}{100 q \log d}\), where \(C_q\) depends on \(q\). Note that the number of neurons \(k\) can be much bigger than the number of samples \(n\) when \(d\) is very small and \(q\) is large.</li>
<li>\(f\) approximately interpolates the samples. That is, \(\lvert f(x_i) - y_i\rvert \leq 0.1 C_q\) for all \(i \in [n]\). (Note that 0.1 can be replaced by \(\epsilon\) and the result can be generalized.)</li>
<li>The neural network uses the activations \(t \mapsto t^q\) and not the ReLU function.</li>
</ul>
<p>This can be thought of as a tensor interpolation problem. Specifically, for \(q = 2\), they perform regression on the space \(x^{\otimes 2} = (x_1^2, x_1x_2, \dots, x_1 x_d,\dots x_2x_1, x_2^2, \dots, x_d^2)\) using the quadratic activation function.
This approach gives the kind of bound they’re looking for, but is a strange enough case that it’s unclear how to extend this to networks with (1) high input dimensions, (2) perfect interpolation, and (3) ReLU activations.</p>
</li>
</ul>
<p>The paper also gives a few constrained versions of the lower bound on the Lipschitz constant for any interpolating function. However, we omit them here because the second paper—BS21—has much better lower bounds.</p>
<h2 id="lower-bound-from-bs21">Lower bound from BS21</h2>
<p>The follow-up paper proves a mostly-tight lower bound, which effectively resolves half of the conjecture.
The results require <em>isoperimetry</em> to hold, which is true of a random variable \(x \in \mathbb{R}^d\) if \(f(x)\) has subgaussian tails for every Lipschitz function \(f\).
This holds for well-known distribution such as (1) multivariate Gaussian distributions, (2) the uniform distribution on \(\mathbb{S}^{d-1}\), (3) and the uniform distribution on the hypercube \(\{-1, 1\}^d\).</p>
<p>By combining their Lemma 3.1 and Theorem 3, the following statement is true about 2-layer neural networks:</p>
<p><em>Let \(\mathcal{F}\) be a family of 2-layer neural networks of width \(k\) with parameters in \([-W, W]\). Suppose each sample \((x_i, y_i)\) is drawn from isoperimetric distribution for all \(i \in [n]\) with \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\) and such that \(\| x_i \|_2 \leq R\) almost surely. Then, with high probability, any neural network \(f \in \mathcal{F}\) that perfectly fits all \(n\) training samples will have a Lipschitz constant of</em></p>
\[\Omega\left(\sqrt{\frac{n}{k \log (W R nk)}}\right).\]
<p>This is close to the conjecture up to logarithmic factors! In addition, this result is more general in the paper:</p>
<ul>
<li>Instead of considering only depth-2 neural networks, they consider all parametric models that change by bounded amounts as their parameter vectors change.</li>
<li>Within their study of neural networks, their analysis also addresses networks that share parameters.</li>
<li>A parameter \(\epsilon\) allows them to conclude that all networks that <em>nearly interpolate</em> must have high Lipschitz constant, not just those that perfectly fit the data.</li>
</ul>
<p>They also account for the fact that the bounds on parameter weights with \(W\) are necessary. Through their Theorem 4, they show the existence of a neural network with a small Lipschitz constant that approximates nearly all of the samples with only a single parameter.
Thus, without these kinds of assumptions, the conjecture is rendered uninformative.</p>
<p>The proof works by considering some fixed \(L\)-Lipschitz function \(f\) and asking how likely it is that \(n\) random samples are almost perfectly fit by \(f\).
By isoperimetry, this can be shown to happen with very low probability.
Then, by making use of an \(\epsilon\)-net argument, one can show that no \(L\)-Lipschitz function \(f\) can perfectly fit the samples.</p>
<p><img src="/assets/images/2021-09-22-bubeck/cover.jpeg" alt="" /></p>
<p>While I breezed over the argument here, it’s a relatively simple one that can be followed by most people with some background in concentration inequalities.</p>
<h2 id="further-questions">Further questions</h2>
<p>While the second paper resolves half of the open question from the first paper, the other half (the existence of a smooth interpolating neural network) remains open.</p>
<p>There are also a few caveats from the second paper that remain to be resolved. For one, it may be possible to loosen the restriction that there be non-zero label noise (i.e. \(\mathbb{E}[\mathrm{Var}[y \mid x]] > 0.1\)).
In addition, the fact that \(\|x_i\|\) must always be bounded is a weakness, since it rules out Gaussian inputs; perhaps this could be improved.</p>
<p>Thanks for tuning in to this week’s blog post! See you next time!</p>Clayton SanfordThis is the seventh of a sequence of blog posts that summarize papers about over-parameterized ML models, which I’m writing to prepare for my candidacy exam. Check out this post to get an overview of the topic and a list of what I’m reading.[OPML#6] XH19: On the number of variables to use in principal component regression2021-09-11T00:00:00+00:002021-09-11T00:00:00+00:00http://blog.claytonsanford.com/2021/09/11/xh19<!-- [XH19](https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf){:target="_blank"} [[OPML#6]](/2021/09/11/xh19.html){:target="_blank"} -->
<p><em>This is the 6th of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<p>Here’s another <a href="https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf" target="_blank">paper</a> by my advisor Daniel Hsu and his former student Ji (Mark) Xu that discusses when overfitting works in linear regression.
This one differs subtly from some of the previously discussed papers (like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a> and <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>) in that it considers <em>principal component regression</em> (PCR) rather than least-squares regression.</p>
<h2 id="principal-component-regression">Principal component regression</h2>
<p>Suppose we have a collection of \(n\) samples \((x_i, y_i) \in \mathbb{R}^{N} \times \mathbb{R}\), which we collect in design matrix \(X \in \mathbb{R}^{n \times N}\) and label vector \(y \in \mathbb{R}^n\).
The standard approach to least-squares regression (which has been given numerous times on this blog) is to choose the \(\hat{\beta}_\textrm{LS} \in \mathbb{R}^N\) that minimizes \(X \hat{\beta}_\textrm{LS} = y\), breaking ties by minimizing the \(\ell_2\) norm \(\|\hat{\beta}_{\textrm{LS}}\|_2\).
This approach considers all dimensions of the inputs \(x_i\).</p>
<p>However, there might a situation where we know \(\Sigma\) a priori and only want to consider the directions in \(\mathbb{R}^N\) that the inputs meaningfully vary along.
This is where <a href="https://en.wikipedia.org/wiki/Principal_component_regression" target="_blank">principal component regression</a> comes in.
Instead of regressing on the training data itself, we regress on the \(p\) most significant dimensions of the data, as identified by <a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal component analysis</a> (PCA).
PCA is a linear dimensionality reduction method that obtains a lower-dimensional representation of \(X\) by approximating each sample as a linear combination of the \(p\) eigenvectors of \(X^T X\) with the largest corresponding eigenvalues.
These \(p\) eigenvectors correspond to the directions in \(\mathbb{R}^N\) where the samples in \(X\) have highest variance.
Moreover, projecting each of the \(n\) samples \(x_i\) onto the space spanned by these \(p\) eigenvectors provides the closest average \(\ell_2\)-approximation of each \(x_i\) as a linear combination of \(p\) fixed vectors in \(\mathbb{R}^N\).</p>
<p>Let \(\mathbb{E}[x_i] = 0\) and \(\Sigma = \mathbb{E}[x_i x_i^T]\) be the covariance matrix of \(x_i\).
If we know \(\Sigma\) ahead of time, then we can simplify things by using only the eigenvectors of \(\Sigma\), rather than the empirical principal components taken from eigenvectors of \(X^T X\).
If the \(p\) eigenvectors \(\Sigma\) with the largest eigenvalues are collected in \(V \in \mathbb{R}^{N \times p}\), then we can express the low-dimensional representation of the training samples as \(X V \in \mathbb{R}^{n \times p}\).
By applying linear regression to these new low-dimensional samples and transforming the resulting parameter vector back to \(\mathbb{R}^N\), we get the parameter vector \(\hat{\beta} = V(X V)^{\dagger} y\), where \(\dagger\) denotes the pseudo-inverse.
(On the other hand, the least-squares parameter vector is \(\hat{\beta}_\textrm{LS} = X^{\dagger} y\).)</p>
<p>The below image visualizes the differences between the least squares and PCR regression algorithms.
It shows a toy example where samples \((x, y)\) (in purple) vary greatly in one direction and not much at all in another direction.
PCR only considers the direction of maximum variance and rules the other out, while least squares considers all directions simultaneously.
Therefore, the hypotheses represented by the green hyperplanes look subtly different for each case.</p>
<p><img src="/assets/images/2021-09-11-xh19/vis.jpeg" alt="" /></p>
<p>Note that this formulation of PCR concerns an idealized setting.
Most regression tasks do not give the learner direct access to \(\Sigma\).
However, it’s possible that \(\Sigma\) could be separately estimated with \(\hat{\Sigma}\) and then applied by PCA.
They authors refer to this as “semi-supervised” because the \(\Sigma\) can be estimated with using only unlabeled samples, since none of the labels \(y\) are used in the approximation.
Due to the high cost of obtaining labeled data, a sufficient dataset for kind of estimate may be significantly easier to obtain than a dataset for the general learning task.</p>
<h2 id="learning-model-and-assumptions">Learning model and assumptions</h2>
<p>They make several restrictive assumptions.
The main purpose of this paper is to construct instances where favorable over-parameterization occurs for PCR, rather than exhaustively catalogue when it must occur.</p>
<p>They assume the samples \(x_i\) have independent Gaussian components and that labels \(y_i = \langle x_i, \beta\rangle\) have no noise.
\(\Sigma\) is a diagonal matrix (which must be the case because of the independent components of each \(x_i\)) with entries \(\lambda_1 > \dots > \lambda_N > 0\).
Therefore, PCR will only use the first \(p\) diagonal entries of \(\Sigma\) and the reduced-dimension version of each sample will merely be its first \(p\) entries.</p>
<p>One weird thing about this paper relative to others is that the true parameter vector \(\beta\) is chosen randomly.
This means it’s an “average-case” bound.
They justify this on the grounds that the ability to choose an arbitrary \(\beta\) could lead to all of the weight being put on the \(N-p\) components that will not be included the PCA’d version of \(X\).
This would make it impossible to have non-trivial error bounds.</p>
<h2 id="over-parameterization-and-pcr">Over-parameterization and PCR</h2>
<p>Now, we have three parameters to consider (\(N, p, n\)), rather than the two (\(p, n\)) typically considered in the previous works on over-parameterization.
As before, they think of over-parameterization as the ratio \(\gamma = \frac{p}{n}\), but they must also contend with the ratios \(\alpha = \frac{p}{N}\) (the fraction of dimensions preserved by PCA) and \(\rho = \frac{n}{N}\) (the ratio of samples to original dimension).</p>
<p>Like <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>, they consider what happens when \(N, p, n \to \infty\) and the ratios remain fixed.
Like BLLT19, their results study how over-parameterization is affected as the eigenvalues of \(\Sigma\) change.
In Section 2, they focus on eigenvalues \(\lambda_1, \dots, \lambda_N\) that decay predictably at a polynomial rate.
Theorems 1 and 2/3 characterize what happens to the expected error in the under-parameterized (\(\gamma \leq 1\)) and over-parameterized (\(\gamma > 1\)) respectively.</p>
<ul>
<li>Theorem 1 shows that the shape of the “classical” regime error curve is preserved in the under-parameterized regime, since it shows that the error decreases as \(\alpha\) increases for fixed \(\rho\), up to a point when it decreases until \(\alpha = \rho\) (equivalently, \(p = n\)).</li>
<li>Theorem 2 shows that the expected error in the interpolation regime \(p > n\) converges to some fixed risk quantity, which can be determined by evaluating an intergral and solving for some quantity.</li>
<li>Theorem 3 shows that for any polynomial rate of decay of the eigenvalues, double-descent will occur and the best interpolating prediction rule will perform better than the best “classical” prediction rule.
In the noisy setting, the best interpolating prediction rule will only outperform the best classical rule in the event that the rate of decay is no faster than \(\frac{1}{i}\).</li>
</ul>
<p>To recap, the optimal performance for PCR is obtained in the over-parameterized regime (with \(p > n\)) if and only if eigenvalues \(\lambda_1, \dots, \lambda_N\) decay slowly; rapid decay leads to optimality in the classical regime.
This echoes the results of BLLT19, which shows that too rapid a decay in eigenvalues causes poor performance in the over-paramterized regime (very-much-not-benign overfitting).
However, BLLT19 also requires that the rate of decay not be too slow, which is a non-issue in this regime.</p>
<p>One of the nice things about this paper–which will be expanded on in the weeks to come–is that it separates the number of parameters \(p\) from the dimension \(N\).
Talking about over-parameterization in linear regression is often awkward because the two quantities are coupled, and we are forced to ask whether favorable behavior in the over-parameterized regime is caused by the high dimension or the high parameter count.
We’ll further examine models with separate dimensions and parameter counts when we study random feature models.</p>Clayton SanfordHow many neurons are needed to approximate smooth functions? A summary of our COLT 2021 paper2021-08-15T00:00:00+00:002021-08-15T00:00:00+00:00http://blog.claytonsanford.com/2021/08/15/hssv21<p>In the past few weeks, I’ve written several summaries of others’ work on machine learning theory.
For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, <a href="http://www.cs.columbia.edu/~rocco/" target="_blank">Rocco Servedio</a> and <a href="https://www.cs.columbia.edu/~djhsu/" target="_blank">Daniel Hsu</a>, and another Columbia PhD student, <a href="http://www.cs.columbia.edu/~emvlatakis/" target="_blank">Manolis Vlatakis-Gkaragkounis</a>.
It will be presented this week at <a href="http://learningtheory.org/colt2021/" target="_blank">COLT (Conference on Learning Theory) 2021</a>, which is happening in-person in Boulder, Colorado.
I’ll be there to discuss the paper and learn more about other work in ML theory.
(Hopefully, I’ll put up another blog post after about what I learned from my first conference.)</p>
<p>The paper centers on a question about neural network approximability; namely, how wide does a shallow neural network need to be to closely approximate certain kinds of “nice” functions?
This post discusses what we prove in the paper, how it compares to previous work, why anyone might care about this result, and why our claims are true.
The post is not mathematically rigorous, and it gives only a high-level idea about why our proofs work, focusing more on pretty pictures and intuition than the nuts and bolts of the argument.</p>
<p>If this interests you, you can check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> to learn more about the ins and outs of our work.
There are also two talks—a 90-second teaser and a 15-minute full talk—and a comment thread available on the <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">COLT website</a>.
This blog post somewhat mirrors the longer talk, but the post is a little more informal and a little more in-depth.</p>
<p>On a personal level, this is my first published computer science paper, and the first paper where I consider myself the primary contributor to all parts of the results.
I’d love to hear what you think about this—questions, feedback, possible next steps, rants, anything.</p>
<h2 id="i-whats-this-paper-about">I. What’s this paper about?</h2>
<h3 id="a-broad-background-on-neural-nets-and-deep-learning">A. Broad background on neural nets and deep learning</h3>
<p>As I discuss in the <a href="/2021/07/04/candidacy-overview.html" target="_blank">overview post for my series on over-parameterized ML models</a>, the practical success of deep learning is poorly understood from a mathematical perspective.
Trained neural networks exhibit incredible performance on tasks like image recognition, text generation, and protein folding analysis, but there is no comprehensive theory of why their performance is so good.
I often think about three different kinds of questions about neural network performance that need to be answered.
I’ll discuss them briefly below, even if only only the first question (approximation) is relevant to the paper at hand.</p>
<ol>
<li>
<p><strong>Approximation:</strong> A neural network is a type of mathematical function that can be represented as a hierarchical arrangement of artifical neurons, each of which takes as input the output of previous neurons, combines them together, and returns a new signal. These neurons are typically arranged in <em>layers</em>, where the number of neurons per layer is referred to as the <em>width</em> and the number of layers is the <em>depth</em>.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn.jpeg" alt="" /></p>
<p>Mathematically, each neuron is a function of the outputs of neurons in a previous layer. If we let \(x_1,x_2, \dots, x_r \in \mathbb{R}\) be the outputs of the \(L\)th layer, then we can define a neuron in the \((L+1)\)th layer as \(\sigma(b + \sum_{i=1}^r w_i x_i)\) where \(b \in \mathbb{R}\) is a <em>bias</em>, \(w \in \mathbb{R}^r\) is a weight vector, and \(\sigma: \mathbb{R} \to \mathbb{R}\) is a nonlinear <em>activation function</em>.
If the parameters \(w\) and \(b\) are carefully selected for every neuron, then many layers of these neurons allow for the representation of complex prediction rules.</p>
<p>For instance, if I wanted a neural network to distinguish photos of cats from dogs, the neural network would represent a function mapping the pixels from the input image (which can be viewed as a vector) to a number that is 1 if the image contains a dog and -1 if the image has a cat. Typically, each neuron will correspond to some kind of visual signal, arranged hierarchically based on the complexity of the signal. For instance, a low-level neuron might detect whether a region of the image contains parallel lines. A mid-level neuron may correspond to certain kind of fur texture, and a high-level neuron could identify whether the ears are a certain shape.</p>
<p><img src="/assets/images/2021-08-15-hssv21/nn-cat.jpeg" alt="" /></p>
<p>This opens up questions about the expressive properties of neural networks: What kinds of functions can they represent and what kinds can’t they? Does there have to be some kind of “niceness” property of the “pixels to cat” map in order for it to be expressed by a neural network? And how large does the neural network need to be in order to express some kind of function? How does increasing the width increase the expressive powers of the network? How about the depth?</p>
<p><em>This paper asks questions like these about a certain family of shallow neural networks. We focus on abstract mathematical functions—there will be no cats or dogs here—but we believe that this kind of work will better help us understand why neural networks work as well as they do.</em></p>
</li>
<li>
<p><strong>Optimization:</strong> Just because there exists a neural network that can represent the prediction rule you want doesn’t mean it’s possible to algorithmically find that function. The \(w\) and \(b\) parameters for each neuron cannot be feasibly hard-coded by a programmer due to the complexity of these kinds of functions. Therefore, we instead <em>learn</em> the parameters by making use of training data.</p>
<p>To do so, a neural network is initialized with random parameter choices. Then, given \(n\) <em>training samples</em> (in our case, labeled images of cats and dogs), the network tunes the parameters in order to come up with a function that predicts correctly on all of the samples. This procedure involves using an optimization algorithm like <em>gradient descent</em> (GD) or <em>stochastic gradient descent</em> (SGD) to tune a good collection of parameters.</p>
<p>However, there’s no guarantee that such an algorithm be able to find the right parameter settings.
GD and SGD work great in practice, but they’re only guaranteed to work for a small subset of optimization problems, such as <em>convex</em> problems.
The training loss of neural networks is non-convex and isn’t one of the problems that can be provably solved with GD or SGD; thus, there’s no guarantee of convergence here.</p>
<p><em>There’s lots of interesting work on optimization, but I don’t really go into it in this blog.</em></p>
</li>
<li>
<p><strong>Generalization:</strong> I’ll be brief about this, since I discuss it a lot more in <a href="/2021/07/04/candidacy-overview.html" target="_blank">my series on over-parameterized ML models</a>. Essentially, it’s one thing to come up with a function that can correctly predict the labels of fixed training samples, but it’s another entirely to expect the prediction rule to <em>generalize</em> to new data that hasn’t been seen before.</p>
<p>The ML theory literature has studied the problem of generalization extensively, but most of the theory about this focuses on simple settings, where the number of parameters \(p\) is much smaller than the number of samples \(n\). Neural networks often live in the opposite regime; these complex and hierarchical functions often have \(p \gg n\), which means that classical statistical approaches to generalization don’t predict that neural networks will perform well.</p>
<p><em>Many papers have tried to explain why over-parameterized models exceed expectations in practice, and I discuss some of those in my other series. But again, this paper does not go into this.</em></p>
</li>
</ol>
<h3 id="b-more-specific-context-on-approximation">B. More specific context on approximation</h3>
<p>As mentioned above, this paper (and hence this post) focuses on the first question of approximation. In particular, it discusses the representational power of a certain family of shallow neural networks. (Typically, “shallow” means depth-2—or one-hidden layer—and “deep” means any networks of depth 3 or more.)</p>
<p>There’s a well-known result about depth-2 networks that we build on: The <em>Universal Approximation Theorem</em>, which states that for any continuous function \(f\), there exists some depth-2 network \(g\) that closely approximates \(f\). (We’ll define “closely approximates” later on.)
Three variants of this result were proved in 1989 by <a href="https://www.sciencedirect.com/science/article/abs/pii/0893608089900038" target="_blank">three</a> <a href="https://www.semanticscholar.org/paper/Multilayer-feedforward-networks-are-universal-Hornik-Stinchcombe/f22f6972e66bdd2e769fa64b0df0a13063c0c101" target="_blank">different</a> <a href="https://link.springer.com/article/10.1007/BF02551274" target="_blank">papers</a>.
Here’s a <a href="http://neuralnetworksanddeeplearning.com/chap4.html" target="_blank">blog post</a> that gives a nice explanation of why these universal approximation results are true.</p>
<p>At first glance, it seems like this would close the question of approximation entirely; if a depth-2 neural network can express any kind of function, then there would be no need to question whether some networks have more approximation powers than others. However, the catch is that the Universal Approximation Theorem does not guarantee that \(g\) will be of a reasonable size; \(g\) could be an arbitrarily wide neural network, which obviously is a no-go in the real world where neural networks actually need to be computed and stored.</p>
<p>As a result, many follow-up papers have focused on the question about which kinds of functions can be <em>efficiently</em> approximated by certain neural networks and which ones cannot. By “efficient,” we mean that we want to show that a function can be approximated by a neural network with a size polynomial in the relevant parameters (the complexity of the function, the desired accuracy, the dimension of the inputs). We specifically <em>do not</em> want a function that requires size exponential in any of these quantities.</p>
<p><em>Depth-separation</em> is an area of study that has focused on studying the limitations of shallow networks compared to deep networks.</p>
<ul>
<li>A <a href="http://proceedings.mlr.press/v49/telgarsky16.html" target="_blank">2016 paper by Telgarsky</a> shows that there exist some very “bumpy” triangular functions that can be approximated by neural networks of depth \(O(k^3)\) with polynomial-wdith, but which require exponential width in order to be approximated by networks of depth \(\Omega(k)\).</li>
<li>Papers by <a href="http://proceedings.mlr.press/v49/eldan16.html" target="_blank">Eldan and Shamir (2016)</a>, <a href="http://proceedings.mlr.press/v70/safran17a.html" target="_blank">Safran and Shamir (2016)</a>, and <a href="http://proceedings.mlr.press/v65/daniely17a.html" target="_blank">Daniely (2017)</a> exhibit functions that separate depth-2 from depth-3. That is, the functions can be approximated by polynomial-size depth-3 networks, but they require exponential width in order to be approximated by depth-2 networks.</li>
</ul>
<p>One thing that these papers have in common is that they all require one of two things.
Either (1) the function is a very “bumpy” one that is highly oscillatory, or (2) the depth-2 networks can partially approximate the function, but cannot approximate it to an extremely high degree of accuracy. A <a href="https://arxiv.org/abs/1904.06984" target="_blank">2019 paper by Safran, Eldan, and Shamir</a> noticed this and asked whether there exist “smooth” functions that have separation between depth-2 and depth-3. This question was inspirational for our work, which poses questions about the limitations of certain kinds of 2-layer neural networks.</p>
<h3 id="c-random-bottom-layer-relu-networks">C. Random bottom-layer ReLU networks</h3>
<p>We actually consider a slightly more restrictive model than depth-2 neural networks. We focus on <em>two-layer random bottom-layer (RBL) ReLU neural networks</em>. Let’s break that down into pieces:</p>
<ul>
<li>
<p>“two layer” means that the neural network has a single hidden layer and can be represented by the following function, for parameters \(u \in \mathbb{R}^r, b \in \mathbb{R}^{r}, w \in \mathbb{R}^{r \times d}\):</p>
\[g(x) = \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>\(r\) is the width of the network and \(d\) is the input dimension.</p>
</li>
<li>“random bottom-layer” means that \(w\) and \(b\) are randomly chosen and then fixed. That means that when trying to approximate a function, we can only tune \(u\). This is also called the <em>random feature model</em> in other papers.</li>
<li>“ReLU” refers to the <em>restricted linear unit</em> activation function, \(\sigma(z) = \max(0, z)\). This is a popular activation function in deep learning.</li>
</ul>
<p>The following graphic visually summarizes the neural network:</p>
<p><img src="/assets/images/2021-08-15-hssv21/rbl.jpeg" alt="" /></p>
<p>Why do we focus on this family of neural networks?</p>
<ol>
<li>Any positive approximation results about this model also apply to arbitrary networks of depth 2. That is, if we want to show that a function can be efficiently approximated by a depth-2 ReLU network, it suffices to show that it can be efficiently approximated by a depth-2 <em>RBL</em> ReLU network. (This does not hold the other direction; there exist functions that can be efficiently approximated by depth-2 ReLU networks that <em>cannot</em> be approximated by depth-2 RBL ReLU nets.)</li>
<li>According to papers like <a href="https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html" target="_blank">Rahimi and Recht (2008)</a>, kernel functions can be approximated with random feature models. This means that our result can also be used to comment on the approximation powers of kernels, which Daniel discusses <a href="https://www.cs.columbia.edu/~djhsu/papers/dimension-argument.pdf" target="_blank">here</a>.</li>
<li>Recent research on the <em>neural tangent kernel (NTK)</em> studies the optimization and generalization powers of randomly-initialized neural networks that do not stray far from their initialization during training. The question of optimizing two-layer neural networks in this regime is then similar to the question of optimizing linear combinations of random features. Thus, the approximation properties proven here carry over to that kind of analysis. Check out papers by <a href="https://arxiv.org/abs/1806.07572" target="_blank">Jacot, Gabriel, and Hongler (2018)</a> and <a href="https://arxiv.org/abs/2002.04486" target="_blank">Chizat and Bach (2020)</a> to learn more about this model.</li>
</ol>
<p>Now, we jump into the specifics of our paper’s claims. Later, we’ll give an overview of how those claims are proven and discuss some broader implications of these results.</p>
<h2 id="ii-what-are-the-specific-claims">II. What are the specific claims?</h2>
<p>The key results in our paper are corresponding upper and lower bounds:</p>
<ul>
<li>If the function \(f: \mathbb{R}^d \to \mathbb{R}\) is either “smooth” or low-dimensional, then it’s “easy” to approximate \(f\) with some RBL ReLU network \(g\). (The upper bound.)</li>
<li>If \(f\) is both “bumpy” and high-dimensional, then it’s “hard” to approximate \(f\) with some RBL ReLU net \(g\). (The lower bound.)</li>
</ul>
<p>All of this is formalized in the next few paragraphs.</p>
<h3 id="a-notation">A. Notation</h3>
<p><strong>What do we mean by a “smooth” or “bumpy” function?</strong> As discussed earlier, works on depth separation frequently exhibit functions that require exponential width to be approximated by depth-2 neural networks. However, these functions are highly oscillatory and hence very steep. We quantify this smoothness by using the Lipschitz constant of a function \(f\). \(f\) has Lipschitz constant \(L\) if for all \(x, y \in \mathbb{R}^d\), we have \(\lvert f(x) - f(y)\rvert \leq L \|x - y\|_2\). This bounds the slope of the function and prevents \(f\) from rapidly changing value. Therefore, a function can only be high-frequency (and bounce back and forth rapidly between large and small values) if it has a small Lipschitz constant.</p>
<p>We also quantify smoothness using the Sobolev class of a function in the appendix of our paper. We provide very similar bounds for this case, but we don’t focus on them in this post.</p>
<p><strong>What does it mean to be easy to approximate?</strong> We consider an \(L_2\) notion of approximation over the solid cube \([-1, 1]^d\). That is, we say that \(g\) <em>\(\epsilon\)-approximates</em> \(f\) if</p>
\[\|g - f\|_2 = \sqrt{\mathbb{E}_{x \sim \text{Unif}([-1, 1]^d)}[(g(x) - f(x))^2]} \leq \epsilon.\]
<p>Notably, this is a <em>weaker</em> notion of approximation than the \(L_\infty\) bounds that are used in other papers. If \(f\) can be \(L_\infty\)-approximated, then it can also be \(L_2\)-approximated.</p>
<p><strong>What does it mean to be easy to approximate <em>with an RBL ReLU function</em>?</strong>
Since we let \(g\) be an RBL ReLU network that has random weights, we need to incorporate that randomness into our definition of approximation. To do so, we say that we can approximate \(f\) with an RBL network of width \(r\) if with probability \(0.5\), there exists some \(u \in \mathbb{R}^r\) such that the RBL neural network \(g\) with parameters \(w, b, u\) can \(\epsilon\)-approximate \(f\).
The probability is over random parameters \(w\) and \(b\) drawn from some distribution \(\mathcal{D}\)
We let the <em>minimum width</em> needed to approximate \(f\) with respect to \(\epsilon\) and \(\mathcal{D}\) denote the smallest such \(r\).</p>
<p>(The paper also includes \(\delta\), which corresponds to the probability of success. For simplicity, we leave it out and take \(\delta = 0.5\).)</p>
<p>We’re now ready to give our two main theorems.</p>
<h3 id="b-the-theorems">B. The theorems</h3>
<p><em><strong>Theorem 1 [Upper Bound]:</strong> For any \(L\), \(d\), \(\epsilon\), there exists a symmetric parameter distribution \(\mathcal{D}\) such that the minimum width of any \(L\)-Lipschitz function \(f: \mathbb{R}^d \to \mathbb{R}\) is at most</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{O(1)}.\]
<p>The term in this bound can also be written as</p>
\[\exp\left(O\left(\min\left(d \log\left(\frac{L^2}{\epsilon^2 d}+ 2\right), \frac{L^2}{\epsilon^2} \log\left(\frac{d\epsilon^2}{L^2} + 2\right)\right)\right)\right).\]
<p><em><strong>Theorem 2 [Lower Bound]:</strong> For any \(L\), \(d\), \(\epsilon\) and any symmetric parameter distribution \(\mathcal{D}\), there exists an \(L\)-Lipschitz function \(f\) whose minimum width is at least</em></p>
\[{d + L^2/ \epsilon^2 \choose d}^{\Omega(1)}.\]
<p>Thus, the key take-away is that our upper and lower bounds are matching up to a polynomial factor:</p>
<ul>
<li>When the dimension \(d\) is constant, than both terms are polynomial in \(\frac{L}{\epsilon}\), which means that \(L\)-Lipschitz \(f\) can be efficiently \(\epsilon\)-approximated.</li>
<li>When the smoothness-to-accuracy ratio \(\frac{L}{\epsilon}\) is constant, then the terms are polynomial in \(d\), which is also efficiently approximable.</li>
<li>When \(d = \Theta(L / \epsilon)\), then both terms are exponential in \(d\), which makes it impossible to efficiently approximate.</li>
</ul>
<p>These back up our high-level claim from before: efficient approximation of \(f\) with RBL ReLU networks is possible if and only if \(f\) is either smooth or low-dimensional.</p>
<p>Before explaining the proofs, we’ll give an overview about why these results are significant compared to previous works.</p>
<h3 id="c-comparison-to-previous-results">C. Comparison to previous results</h3>
<p>The approximation powers of shallow neural networks has been widely studied in terms of \(d\), \(\epsilon\), and smoothness measures (including Lipschitzness).
Our results are novel because they’re the first (as far as we know) to look closely at the interplay between these values and obtain nearly tight upper and lower bounds.</p>
<p>Papers that prove upper bounds tend to focus on either the low-dimensional case or the smooth case.</p>
<ul>
<li><a href="http://proceedings.mlr.press/v32/andoni14.html" target="_blank">Andoni, Panigrahy, Valiant, and Zhang (2014)</a> show that degree-\(k\) polynomials can be approximated with RBL networks of width \(d^{O(k)}\). Because \(L\)-Lipschitz functions can be approximated by polynomials of degree \(O(L^2 / \epsilon^2)\), one can equivalently say that networks of width \(d^{O(L^2 / \epsilon^2)}\) are sufficient. This works great when \(L /\epsilon\) is constant, but the bounds are bad in the “bumpy” case where the ratio is large.</li>
<li>On the other hand, <a href="https://jmlr.org/papers/v18/14-546.html" target="_blank">Bach (2017)</a> shows \((L / \epsilon)^{O(d)}\)-width approximability results for \(L_\infty\). This is fantastic when \(d\) is small, but not in the high-dimensional case. (This \(L_\infty\) part is more impressive than our \(L_2\) bounds, which means that we don’t strictly improve upon this result in our domain.)</li>
</ul>
<p>Our results are the best of both worlds, since they trade off \(d\) versus \(L /\epsilon\). They also cannot be substantially improved upon because they are nearly tight with our lower bounds.</p>
<p>Our lower bounds are novel because they handle a broad range of choices for \(L/ \epsilon\) and \(d\).</p>
<ul>
<li>The limitations of 2-layer neural networks were studied in the 1990s by <a href="https://www.sciencedirect.com/science/article/pii/S0021904598933044" target="_blank">Maiorov (1999)</a>, and he proves bounds that looks more impressive than ours at first glance, since he argues that width \(\exp(\Omega(d))\) width is necessary for smooth functions. (He actually looks at Sobolev smooth functions, but the analysis could also be done for Lipschitz functions.) However, these bounds don’t necessarily hold for all choices of \(\epsilon\). Therefore, they don’t say anything about the regime where \(\frac{L}{\epsilon}\) is constant, where it’s impossible to prove a lower bound that’s exponential in \(d\).</li>
<li><a href="https://arxiv.org/abs/1904.00687" target="_blank">Yehudai and Shamir (2019)</a> show that \(\exp(d)\) width is necessary to approximate simple ReLU functions with RBL neural networks. However, their results require that the ReLU be a very steep one, with Lipschitz constant scaling polynomially with \(d\). Hence, this result also only covers the regime where \(\frac{L}{\epsilon}\) is large. Our bounds say something about functions of all levels of smoothness.</li>
</ul>
<p>Now, we’ll break down our argument on a high level, with the help of some pretty pictures.</p>
<h2 id="iii-why-are-they-true">III. Why are they true?</h2>
<p>Before giving the proofs, I’m going to restate the theorems in terms of a combinatorial quantity, \(Q_{k,d}\), which corresponds to the number of \(d\)-dimensional integer lattice points with \(L_2\) norm at most \(k\). That is,</p>
\[Q_{k,d} = \lvert\{K \in \mathbb{Z}^d: \|K\|_2 \leq k \} \rvert.\]
<p>As an example, \(Q_{4,2}\) can be visualized as the number of purple points in the below image:</p>
<p><img src="/assets/images/2021-08-15-hssv21/qkd.jpeg" alt="" width="50%" /></p>
<p>We can equivalently write the upper and lower bounds on the minimum width as \(Q_{2L/\epsilon, d}^{O(1)}\) and \(\Omega(Q_{L/18\epsilon, d})\) respectively. This combinatorial quantity turns out to be important for the proofs of both bounds.</p>
<p>A key building block for both proofs is an orthonormal basis. I define orthonormal bases in <a href="/2021/07/16/orthogonality.html" target="_blank">a different blog post</a> and explain why they’re useful there. If you aren’t familiar, check that one out. We use the following family of sinusoidal functions as a basis for the \(L_2\) Hilbert space on \([-1, 1]^d\) throughout:</p>
\[\mathcal{T} \approx \{T_K: x \mapsto \sqrt{2}\cos(\pi\langle K, x\rangle): K \in \mathbb{Z}^d\}.\]
<p><em>Note: This is an over-simplification of the family of functions to be easier to write down. Actually, half of the functions need to be sines instead of cosines. However, it’s a bit of a pain to formalize and you can see how it’s written up in the paper. I’m using the \(\approx\) symbol above because this is “morally” the same as the true family of functions, but a lot easier to write down.</em></p>
<p>This family of functions has several properties that are very useful for us:</p>
<ul>
<li>
<p>The functions are orthonormal with respect to the Hilbert space for the \(L_2\) space over the uniform distribution on \([-1, 1]^d\). That is, for all \(K. K' \in \mathcal{T}\),</p>
\[\langle T_K, T_{K'}\rangle = \mathbb{E}_{x}[T_K(x)T_{K'}(x)] = \begin{cases}1 & K = K' \\ 0 & \text{otherwise.} \\ \end{cases}\]
</li>
<li>The functions span the Hilbert space \(L_2([-1,1]^d)\). Put together with the orthonormality, \(\mathcal{T}\) is an orthonormal basis for \(L_2([-1,1]^d)\).</li>
<li>The Lipschitz constant of each of these functions is bounded. Specifically, the Lipschitz constant of \(T_K\) is at most \(\sqrt{2} \pi \|K\|_2\).</li>
<li>The derivative of each function in \(\mathcal{T}\) is also a function that’s contained in \(\mathcal{T}\) (if you include the sines too).</li>
<li>All elements of \(\mathcal{T}\) are ridge functions. That is, they can each be written as \(T_K(x) = \phi(\langle v, x \rangle)\) for some \(\phi:\mathbb{R}\to \mathbb{R}\).The function depends only on one direction in \(\mathbb{R}^d\) and is intrinsically one-dimension. This will be important for the upper bound proof.</li>
<li>If we let \(\mathcal{T}_k = \{T_K \in \mathcal{T}: \|K\|_2 \leq k\}\), then \(\lvert\mathcal{T}_k\rvert = Q_{k,d}\).</li>
</ul>
<p>Now, we’ll use this basis to discuss our proof of the upper bound.</p>
<h3 id="a-upper-bound-argument">A. Upper bound argument</h3>
<p>The proof of the upper bound boils down to two steps. First, we show that the function \(f\) can be \(\frac{\epsilon}{2}\)-approximated by a low-frequency trigonometric polynomial (that is, a linear combination of sines and cosines in \(\mathcal{T}_k\) for some \(k = O(L^2 / \epsilon^2)\)). Then, we show that this trigonometric polynomial can be \(\frac{\epsilon}{2}\)-approximated in turn by an RBL ReLU network.</p>
<p>For the first step—which corresponds to Lemma 7 of the paper—we apply the fact that \(f\) can be written as a linear combination of sinusoidal basis elements. That is,</p>
\[f(x) = \sum_{K \in \mathbb{Z}^d} \alpha_K T_K(x),\]
<p>where \(\alpha_K = \langle f, T_K\rangle\).
This means that \(f\) is a combination of sinusoidal functions pointing in various directions of various frequencies.
We show that for some \(k = O(L / \epsilon)\),</p>
\[P(x) := \sum_{K \in \mathbb{Z}^d, \|K\|_2 \leq k} \alpha_K T_K(x)\]
<p>satisfies \(\|P - f\|_2 \leq \frac{\epsilon}{2}\).
To do so, we show that all \(\alpha_K\) terms for \(\|K\|_2 > k\) are very close to zero in the proof of Lemma 8.
The argument centers on the idea that if \(\alpha_K\) is large for large \(\|K\|_2\), then \(f\) is heavily influenced by a high-frequncy sinusoidal function, which means that \(\|\nabla f(x)\|\) must be large at some \(x\).
However, \(\|\nabla f(x)\| \leq L\) by our smoothness assumption on \(f\), so too large values of \(\alpha_K\) contradict this.</p>
<p>For the second part, we show that \(P\) can be approximated by a linear combination of random ReLUs.
To do so, we express \(P\) as a <em>superposition</em> of or expectation over random ReLUs.
We show that there exists some parameter distribution \(\mathcal{D}\) (which depends on \(d, L, \epsilon\), but not on \(f\)) and some bounded function \(h(b, w)\) (which <em>can</em> depend on \(f\)) such that</p>
\[P(x) = \mathbb{E}_{(b, w) \sim \mathcal{D}}[h(b, w)\sigma(\langle w, x\rangle + b)].\]
<p>However, it’s not immediately clear how one could find \(h\) and why one would know that \(h\) is bounded.
To find \(h\), we take advantage of the fact that \(P\) is a linear combination of trigonometric sinusoidal ridge functions by showing that every \(T_K\) can be expressed as a superposition of ReLUs and combining those to get \(h\).
The “ridge” part is key here; because each \(T_K\) is effectively one-dimensional, it’s possible to think of it being approximated by ReLUs, as visualized below:</p>
<p><img src="/assets/images/2021-08-15-hssv21/cos.jpeg" alt="" /></p>
<p>Each function \(T_K\) can be closely approximated by a piecewise-linear ridge function, since it has bounded gradients and because it only depends on \(x\) through \(\langle K, x\rangle\).
Therefore, \(T_K\) can also be closely approximated by a linear combination of ReLUs, because those can easily approximate piecewise linear ridge functions.
This makes it possible to represent each \(T_K\) as a superposition of ReLUs, and hence \(P\) as well.</p>
<p>Now, \(f\) is closely approximated by \(P\), and \(P\) can be written as a bounded superpositition of ReLUs.
We want to show that \(P\) can be approximated by a linear combination of a <em>finite and bounded</em> number of random ReLUs, not an infinite superposition of them.
This last step requires sampling \(r\) sets of parameters \((b^{(i)}, w^{(i)}) \sim \mathcal{D}\) for \(i \in \{1, \dots, r\}\) and letting</p>
\[g(x) := \frac{1}{r} \sum_{i=1}^r h(b^{(i)}, w^{(i)}) \sigma(\langle w^{(i)}, x\rangle + b^{(i)}).\]
<p>When \(r\) is large enough, \(g\) is a 2-layer RBL ReLU network that becomes a very close approximation to \(P\), which means it’s also a great approximation to \(f\).
Such a sufficiently large \(r\) can be quantified with the help of standard concentration bounds for Hilbert spaces.
This wraps up the upper bound.</p>
<h3 id="b-lower-bound-argument">B. Lower bound argument</h3>
<p>For the lower bounds, we want to show that for any bottom-layer parameters \((b^{(i)}, w^{(i)})\) for \(1 \leq i \leq r\), there exists some \(L\)-Lipschitz function \(f\) such that for any choice of top-layer \(u^{(1)}, \dots, u^{(r)}\):</p>
\[\sqrt{\mathbb{E}_x\left[\left(f(x) - \sum_{i=1}^r u^{(i)} \sigma(\langle w^{(i)}, x\rangle + b^{(i)})\right)^2\right]} \geq \epsilon.\]
<p>This resembles a simpler linear algebra problem:
Fix any vectors \(v_1, \dots, v_r \in \mathbb{R}^N\).
\(\mathbb{R}^N\) has a standard orthonormal basis \(e_1, \dots, e_N\).
Under which circumstances is there some \(e_j\) that cannot be closely approximated by any linear combination of \(v_1, \dots, v_r\)?</p>
<p>It turns out that when \(N \gg r\) there can be no such approximation.
This follows by a simple dimensionality argument.
The span of \(v_1, \dots, v_r\) is a subspace of dimension at most \(r\).
Since \(r \ll N\), it makes sense that an \(r\)-dimensional subspace cannot be close to every \(N\) orthonormal vector, since they lie in a much higher dimensional object and each is perpendicular to every other.</p>
<p><img src="/assets/images/2021-08-15-hssv21/span.jpeg" alt="" /></p>
<p>For instance, the above image illustrates the claim for \(N = 3\) and \(r = 2\). While the span of \(v_1\) and \(v_2\) is close to \(e_1\) and \(e_2\), the vector \(e_3\) is far from that plane, and hence is inapproximable by linear combinations of the two.</p>
<p>In our setting, we replace \(\mathbb{R}^N\) with the \(L_2\) Hilbert space over functions on \([-1, 1]^d\); \(v_1, \dots, v_r\) with \(x \mapsto \sigma(\langle w^{(1)}, x\rangle + b^{(1)}), \dots, x \mapsto \sigma(\langle w^{(r)}, x\rangle + b^{(r)})\); and \(\{e_1, \dots, e_N\}\) with \(\mathcal{T}_k\) for \(k = \Omega(L)\).
As long as \(Q_{k,d} \gg r\), then there is some \(O(\|K \|_2)\)-Lipschitz function \(T_K\) that can’t be approximated by linear combinations of ReLU features.
By the assumption on \(k\), \(T_K\) must be \(L\)-Lipschitz as well.</p>
<p>The dependence on \(\epsilon\) can be introduced by scaling \(T_K\) appropriately.</p>
<h2 id="parting-thoughts">Parting thoughts</h2>
<p>To reiterate, our results show the capabilities and limitations of 2-layer random bottom-layer ReLU networks.
We show a careful interplay between the Lipschitzness of the function to approximate \(L\), the dimension \(d\), and the accuracy parameter \(\epsilon\).
Our bounds rely heavily on orthonormal functions.</p>
<p>Our results have some key limitations.</p>
<ul>
<li>Our upper bounds would be more impressive if they used the \(L_\infty\) notion of approximation, rather than \(L_2\). (Conversely, our lower bounds would be <em>less</em> impressive if they used \(L_\infty\) instead.)</li>
<li>The distribution over training parameters \(\mathcal{D}\) that we end up using for the upper bounds is contrived and depends on \(L, \epsilon, d\) (even if not on \(f\)).</li>
<li>Our bounds only apply when samples are drawn uniformly from \([-1, 1]^d\). (We believe our general approach will also work for the Gaussian probability measure, which we discuss at a high level in the appendix of our paper.)</li>
</ul>
<p>We hope that these limitations are addressed by future work.</p>
<p>Broadly, we think our paper fits into the literature on neural network approximation because it shows that the smoothness of a function is very relevant to its ability to be approximated by shallow neural networks.</p>
<ul>
<li>Our paper contributes to the question posed by <a href="https://arxiv.org/abs/1904.06984" target="_blank">SES19</a> (Are there any 1-Lipschitz functions that cannot be approximated efficiently by depth-2 but can by depth-3?) by showing that <em>all</em> 1-Lipschitz functions are approximable with respect to the \(L_2\) measure.</li>
<li>In addition, our results build on those of a recent paper by <a href="https://arxiv.org/abs/2102.00434" target="_blank">Malach, Yehudai, Shalev-Shwartz, and Shamir (2021)</a>, that suggests that the only functions that can be efficiently <em>learned</em> via gradient descent by deep networks are those that can be efficiently <em>approximated</em> by a shallow network. They show that the inefficient approximation of a function by depth-3 neural networks implies inefficient learning by neural networks of any depth; our results strengthens this to “inefficient approximation of a function by depth-<strong>2</strong> neural networks.”</li>
</ul>
<p>Thank you so much for reading this blog post! I’d love to hear about any thoughts or questions you may have. And if you’d like to learn more, check out <a href="http://proceedings.mlr.press/v134/hsu21a.html" target="_blank">the paper</a> or <a href="http://www.learningtheory.org/colt2021/virtual/poster_1178.html" target="_blank">the talks</a>!</p>Clayton SanfordIn the past few weeks, I’ve written several summaries of others’ work on machine learning theory. For the first time on this blog, I’ll discuss a paper I wrote, which was a collaboration with my advisors, Rocco Servedio and Daniel Hsu, and another Columbia PhD student, Manolis Vlatakis-Gkaragkounis. It will be presented this week at COLT (Conference on Learning Theory) 2021, which is happening in-person in Boulder, Colorado. I’ll be there to discuss the paper and learn more about other work in ML theory. (Hopefully, I’ll put up another blog post after about what I learned from my first conference.)[OPML#5] BL20: Failures of model-dependent generalization bounds for least-norm interpolation2021-07-30T00:00:00+00:002021-07-30T00:00:00+00:00http://blog.claytonsanford.com/2021/07/30/bl20<p><em>This is the fifth of a <a href="/2021/07/04/candidacy-overview.html" target="_blank">sequence of blog posts</a> that summarize papers about over-parameterized ML models.</em></p>
<!-- [BL20](https://arxiv.org/abs/2010.08479){:target="_blank"} [[OPML#5]](/2021/07/30/bl20.html){:target="_blank"} -->
<p>I really enjoyed reading this paper, <a href="https://arxiv.org/abs/2010.08479" target="_blank">“Failures of model-dependent generalization bounds for least-norm interpolation,”</a> by Bartlett and Long. (The names are familiar from <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a>.)
It follows in the vein of papers like <a href="https://arxiv.org/abs/1611.03530" target="_blank">ZBHRV17</a> and <a href="https://arxiv.org/abs/1902.04742" target="_blank">NK19</a>, which demonstrate the limitations of classical generalization bounds.</p>
<p>This work differs from the double-descent papers that have been previously reviewed on this blog, like <a href="https://arxiv.org/abs/1903.07571" target="_blank">BHX19</a> <a href="/2021/07/05/bhx19.html" target="_blank">[OPML#1]</a>, <a href="https://arxiv.org/abs/1906.11300" target="_blank">BLLT19</a> <a href="/2021/07/11/bllt19.html" target="_blank">[OPML#2]</a>, <a href="https://ieeexplore.ieee.org/document/9051968" target="_blank">MVSS19</a> <a href="/2021/07/16/mvss19.html" target="_blank">[OPML#3]</a>, and <a href="https://arxiv.org/abs/1903.08560" target="_blank">HMRT19</a> <a href="/2021/07/23/hmrt19.html" target="_blank">[OPML#4]</a>.
These papers argue that there exist better bounds on generalization error for over-parameterized linear regression than the ones typically suggested by classical approaches like VC-dimension and Rademacher complexity.
However, they dont <em>prove</em> that there cannot be better “classical” generalization bounds; they just show that the well-known bounds are inferior to their proposed bounds.
On the other hand, this paper proves that a broad family of traditional generalization bounds are unable to explain the phenomenon of the success of interpolating methods.</p>
<p>The gist of the argument is that it’s not sufficient to look at the number of samples and the complexity of the hypothesis to explain the success of interpolating models.
Successful bounds must take into account more information about the data distribution.
Notably, the bounds in BHX19, BLLT19, MVSS19, and HMRT19 all rely on properties of the data distribution, like the eigenvalues of the covariance matrix and the amount of additive noise in each label.
The current paper (BL20) posits that such tight bounds are impossible without access to this kind of information.</p>
<p>In this post, I present the main theorem and give a very hazy idea about why it works.
Let’s first make the learning problem precise.</p>
<h2 id="learning-problem">Learning problem</h2>
<ul>
<li>We have labeled data \((x, y) \in \mathbb{R}^d \times \mathbb{R}\) drawn from some distribution \(P\).
<ul>
<li>They restrict \(P\) to give it nice mathematical properties. Specifically, the inputs \(x \in \mathbb{R}^d\) must be drawn from a Gaussian distribution and \((x, y)\) must have subgaussian tails. We’ll call these “nice” distributions.</li>
</ul>
</li>
<li>Let the <em>risk</em> of some prediction rule \(h: \mathbb{R}^d \to \mathbb{R}\) be \(R_P(h) = \mathbb{E}_{x, y}[(y - h(x))^2]\).</li>
<li>Let \(R_P^*\) be the best risk over all \(h\).</li>
<li>The goal is to consider bounds on \(R_P(h) - R_P^*\), where \(h\) is an <em>least-norm interpolating</em> learning rule on \(n\) training samples.
<ul>
<li>i.e. \(h(x) = \langle x, \theta\rangle\) where \(\theta \in \mathbb{R}^d\) minimizes the least-squares error: \(\sum_{i=1}^n(\langle x_i, \theta\rangle - y_i)^2\). Ties are broken by choosing the \(\theta\) that minimizes \(\|\theta\|_2\). The interpolation regime occurs when the least-squares error is zero.</li>
</ul>
</li>
<li>We consider bounds \(\epsilon(h, n, \delta)\), such that \(R_p(h)- R_P^{*} \leq \epsilon(h, n, \delta)\) with probability \(1 - \delta\) over the \(n\) training samples from \(P\), for which \(h\) is least-norm interpolating.
<ul>
<li>Notably, these bounds cannot include any more information about the learning problem; these must hold for any distribution \(P\).</li>
<li>For the theorem to work, they restrict themselves to bounds that are <em>bounded antimonotonic</em>, which means that they cannot suddenly become much worse as the number of samples increases. (e.g. \(\epsilon(h, 2n, \delta)\) cannot be much larger than \(\epsilon(h, n, \delta)\).)</li>
</ul>
</li>
</ul>
<h2 id="the-result">The result</h2>
<p>Now, I give a rather hand-wavy paraphrase of the theorem:</p>
<p><em><strong>Theorem 1:</strong> Suppose \(\epsilon\) is a bound that depends on \(h\), \(n\), and \(\delta\) that applies to all nice distributions \(P\).
Then, for a “very large fraction” of values of \(n\) as \(n\) grows, there exists a distribution \(P_n\) such that</em></p>
\[\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \leq O(1 / \sqrt{n})] \geq 1 - \delta\]
<p><em>but</em></p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq \Omega(1)] \geq \frac{1}{2},\]
<p><em>where \(h\) is the least-norm interpolant of a set of \(n\) points drawn from \(P_n\). The probabilities above refer to randomness from the training sample drawn from \(P_n\).</em></p>
<p>Let’s break this down and talk about what it means.</p>
<p>The generalization bound \(\epsilon\) can depend on the minimum-norm interpolating prediction rule \(h\), the number of samples \(n\), and the confidence parameter \(\delta\).
It <em>cannot</em> depend on the distribution over samples \(P\), and it must apply to all such “nice” distributions.
This opens up the possibility that a satisfactory bound \(\epsilon\) could perform much better on some distributions than others.</p>
<ul>
<li>
<p>This result particularly applies to generalization bounds that make use of some property of the prediction rule \(h\). For instance, it demonstrates the limitations of <a href="https://ieeexplore.ieee.org/document/661502" target="_blank">this 1998 Bartlett paper</a>, which gives generalization bounds that are small when the parameters of \(h\) have small norms.</p>
</li>
<li>
<p>Note that this isn’t really talking about “traditional” capacity-based generalization bounds, like those that rely on <a href="https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension" target="_blank">VC-dimension</a>. These capacity-based bounds are applied to the <em>hypothesis class</em> \(\mathcal{H}\) that contains \(h\), rather than the prediction rule \(h\) itself.</p>
<p>These kinds of bounds are already overly pessimistic in the over-parameterized regime, however. Measurements of the capacity of \(\mathcal{H}\)—like the VC-dimension and the <a href="https://en.wikipedia.org/wiki/Rademacher_complexity" target="_blank">Rademacher complexity</a>—will always lead to vacuous generalization bounds for interpolating classifiers because those bounds rely on limiting the expressive power of hypotheses in \(\mathcal{H}\). From the lens of capacity-based generalization approaches, overfitting is <em>always</em> bad, which makes a nontrivial analysis of interpolation methods impossible with these tools.</p>
</li>
</ul>
<p>\(\epsilon\) does indeed perform much better on some distributions than others. The meat and potatoes of the proof shows the existence of some nice distribution where a bound \(\epsilon\) necessarily underperforms, even though the minimum-norm interpolating solution actually has a small generalization error.</p>
<p><img src="/assets/images/2021-07-30-bl20/bound.jpeg" alt="" /></p>
<ul>
<li>
<p>The first inequality in the theorem demonstrates how the minimum-norm interpolating classifier does well.
This is represnted by the the true generalization errors lying below the green dashed line, which corresponds to the bound in the first inequality.
As \(n\) grows, the true generalization error approaches zero with high probability.</p>
</li>
<li>
<p>On the other hand, the underperformance is illustrated by the second inequality, which shows that the bound \(\epsilon\) often cannot guarrantee that the generalization error is smaller than some constant as \(n\) becomes large.
As visualized above, the bound \(\epsilon\) (represented by red dots with a red line corresponding to the expected value of \(\epsilon\)) will most of the time (but not always) lie above the constant curve denoted by the dashed red line.
This isn’t great, because we should expect an abundance of training samples \(n\) to translate to an error bound that approaches zero as \(n\) approaches infinity.</p>
</li>
</ul>
<p>So far, nothing has been said about the dimension of the inputs, \(d\).
The authors define \(d\) within the context of the distributions \(P_n\) as roughly \(n^2\). Thus, \(d \gg n\) and this problem deals squarely with the over-parameterized regime.</p>
<p>To reiterate, the key takeaway here is that the data distribution is very important for evaluating whether successful generalization occurs.
Without knowledge of the data distribution, it’s impossible to give accurate generalization bounds for the over-parameterized case (\(d \gg n\)).</p>
<h2 id="proof-ideas">Proof ideas</h2>
<p>The main strategy in this proof is to show the existence of a “good distribution” \(P_n\) and a “bad distribution” \(Q_n\) that are very similar, but where minimum-norm interpolation yields a much smaller generalzation error on \(P_n\) than \(Q_n\).
This gap forces any valid generalization error bound \(\epsilon\) to be large, despite the fact that the the minimum-norm interpolator has small generalzation error for \(P_n\).</p>
<p>To satisfy the similarity requirement, \(P_n\) and \(Q_n\) must be indistinguishable with respect to \(h\).
Consider full training samples of \(n\) \(d\)-dimensional inputs and labels \((X_P, Y_P), (X_Q, Y_Q) \in \mathbb{R}^{n \times d} \times \mathbb{R}^n\) drawn from the two respective distributions.
Then, the probability that \(h\) is the minimum-norm interpolator of \((X_P, Y_P)\) must be identical to the probability that it is the minimum-norm interpolator of \((X_Q, Y_Q)\).
If this is the case, then \(\epsilon\) must be defined to ensure that each of</p>
\[\epsilon(h, n, \delta) \geq R_{P_n}(h) - R_{P_n}^* \quad \text{and} \quad \epsilon(h, n, \delta)\geq R_{Q_n}(h) - R_{Q_n}^*\]
<p>hold with probability \(1 - \delta\).
This then means that it must be the case that for any \(t \in \mathbb{R}\):</p>
\[\mathrm{Pr}_{P_n}[\epsilon(h, n, \delta) \geq t] \geq \max(\mathrm{Pr}_{P_n}[R_{P_n}(h) - R_{P_n}^* \geq t], \mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq t]).\]
<p>To prove the theorem, it suffices to show \(R_{P_n}(h) - R_{P_n}^*\) is very small and \(R_{Q_n}(h) - R_{Q_n}^*\) is large with high probability.
This forces \(\epsilon(h, n, \delta)\) to be large and \(R_{P_n}(h) - R_{P_n}^*\) to be small with high probability, which concludes the proof.</p>
<p>A key idea towards showing this gap between the generalization of \(P_n\) and \(Q_n\) is to define distributions that behave very differently in testing, despite being indistinguishable from the standpoint of training.
To implement this idea, \(Q_n\) will reuse samples in testing phase, while \(P_n\) will not.</p>
<p>Now, we define the two distributions, with the help of a third “helper” distribution \(D_n\).</p>
<h3 id="d_n-the-skewed-gaussian-distribution">\(D_n\): The skewed Gaussian distribution</h3>
<p>We draw an input \(x_i\) from the \(d\)-dimensional Gaussian distribution \(\mathcal{N}(0, \Sigma)\) with mean zero and diagonal covariance matrix \(\Sigma\) with</p>
\[\Sigma_{j,j} = \lambda_j = \begin{cases}
\frac{1}{81} & j = 1 \\
\frac{1}{d^2} & j > 1.
\end{cases}\]
<p>When \(d\) is large, this corresponds to a distribution where \(x_1\) will be very large relative to \(x_2, \dots, x_d\), which trend towards zero.
The label \(y_i\) is drawn by taking \(y_i = \langle x_i, \theta\rangle + \epsilon_i\), where \(\epsilon_i \sim \mathcal{N}(0, \frac{1}{81})\).
Thus, the noise is drawn at the scale of the dominant first coordinate.</p>
<p><img src="/assets/images/2021-07-30-bl20/Dn.jpeg" alt="" /></p>
<p>We use this skewed distribution because it works beautifully with the bounds in the minimum-norm interpolant that are laid out in BLLT19.
Using notation from <a href="/2021/07/11/bllt19.html" target="_blank">my blog post on BLLT19</a>, we can characterize the effective dimensions \(r_k(\Sigma)\) and \(R_k(\Sigma)\), which yield clean risk bounds.</p>
\[r_k(\Sigma) = \frac{\sum_{j > k} \lambda_j}{\lambda_{k+1}} =
\begin{cases}
\frac{\frac{1}{81} + \frac{d-1}{d^2}}{\frac{1}{81}} = \Theta(1) & k = 0 \\
\frac{\frac{d-k}{d^2}}{\frac{1}{d^2}} = d-k & k > 0.
\end{cases}\]
\[R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2} =
\begin{cases}
\frac{\left(\frac{1}{81} + \frac{d-1}{d^2}\right)^2}{\frac{1}{81^2} + \frac{d-1}{d^4}} = \Theta(1) & k = 0 \\
\frac{\left(\frac{d-k}{d^2}\right)^2}{\frac{d-k}{d^4}} = d-k & k > 0.
\end{cases}\]
<p>By taking \(k^* = 1\) and applying the bound, then with high probability:</p>
\[R(\hat{\theta}) = O\left(\|\theta^*\|^2 \lambda_1\left( \sqrt{\frac{r_0(\Sigma)}{n}} + \frac{r_0(\Sigma)}{n}\right) + \sigma^2\left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right) \right)\]
\[= O\left(\|\theta^*\|^2\left(\frac{1}{\sqrt{n}} + \frac{1}{n}\right) + \frac{1}{81}\left(\frac{1}{n} + \frac{n}{d-1}\right)\right).\]
<p>If we take \(d = n^2\), then this term trends towards zero at a rate of \(\frac{1}{\sqrt{n}}\) as \(n\) approaches infinity, which validates the kind of bound w’ere looking at for \(P_n\).
(Note: \(d\) does not exactly equal \(n^2\) in the paper; there are a few more technicalities here that we’re glossing over.)</p>
<p>This gives us an example where minimum-norm interpolation does fantastically. However, it does not show why the generalization bound \(\epsilon(h, n, \delta)\) cannot be tight.
To do so, we define the actual two distributions we care about—\(Q_n\) and \(P_n\)—in terms of \(D_n\).</p>
<h3 id="q_n-poor-interpolation-from-sample-reuse">\(Q_n\): Poor interpolation from sample reuse</h3>
<p>The first confusing thing about \(Q_n\) is that it’s a random distribution.
That is, we can think of \(Q_n\) being drawn from a distribution over distributions \(\mathcal{Q}_n\), since it depends on a random sample from \(D_n\).</p>
<p>To define \(Q_n\), draw \(m = \Theta(n)\) independent samples \((x_i, y_i)_{i \in [m]}\) from \(D_n\).
\(Q_n\) will be supported on these \(m\) samples.</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn1.jpeg" alt="" /></p>
<p>After fixing these samples, we can draw \((x, y)\) from \(Q_n\) by first uniformly selecting \(x\) from \(\{x_1, \dots, x_m\}\), the set of pre-selected points.
Then, we choose \(y\) using the same approach that we did for \(D_n\): \(y = \langle x, \theta\rangle + \epsilon\) for \(\epsilon \sim \mathcal{N}(0, \frac{1}{81})\).</p>
<p><img src="/assets/images/2021-07-30-bl20/Qn2.jpeg" alt="" /></p>
<p>What this means is that the training inputs \(x_i\) for \(i \in [n]\) will exactly reoccur in the expected risk, albeit with different labels \(y_i\).
This differs greatly from \(D_n\), where the continuity of the distribution over \(x_i\)’s ensures that the same exact sample would never realistically be chosen in “testing.”</p>
<p>The crux of the argument that \(Q_n\) is “bad” comes from Lemma 5, which suggests that least-norm interpolation will perform poorly on inputs \(x_i\) that show up exactly once in the training set.
When these are drawn again when computing the expected risk (with new labels), they’ll have substantially higher error than would a random input from \(D_n\).
This allows the authors to show that—for a proper choice of \(m\)—</p>
\[\mathrm{Pr}_{Q_n}[R_{Q_n}(h) - R_{Q_n}^* \geq \Omega(1)] \geq \frac{1}{2}.\]
<p>Now, it only remains to show that \(Q_n\) is indistinguishable in the training phase from a “good” distribution that has low risk for least-norm interpolation.</p>
<p>\(D_n\) is good, but \(Q_n\) unfortunately cannot be contrasted to \(D_n\) in this manner.
Because \(D_n\) never repeats training samples, the two have somewhat different distributions over interpolators \(h\).
Instead, we define \(P_n\) in a slightly different way to have the nice interpolation properties of \(D_n\), while being identical to \(Q_n\) in the training phase.</p>
<h3 id="p_n-d_n-but-with-extra-samples">\(P_n\): \(D_n\) but with extra samples</h3>
<p>The idea with \(D_n\) is that it draws inputs \(x_i\) from \(P_n\), but that it will occasionally draw more than one and average their labels \(y_i\) together to produce a new label.</p>
<p>This provides indistinguishability from \(Q_n\) in the training phase.
Both draw a collection of samples—with some of them appearing multiple times in the training set—and both minimum-norm interpolators will take these properties into account.
This indistinguishability is proved in Lemma 7 and relies on careful choices of the number of original samples \(m\) for \(Q_n\) and the amount of repeated samples in \(P_n\). This idea is put together with Lemma 5 (which shows that \(Q_n\) has poor minimum-norm interpolation behavior) to show that \(\epsilon(h, n, \delta)\) cannot be small.</p>
<p>However, \(P_n\) is <em>not</em> a random distribution and it will <em>not</em> carry that repetition over to the “evaluation phase.”
The distribution used to evaluate risk—like \(D_n\) and unlike \(Q_n\)—will not contain any of the same \(x_i\)’s that were used in the training phase.
This causes the interpolation guarantees to be roughly the same as \(D_n\).
This gives the gap we’re looking for, which is formalized in Lemma 10.</p>
<p>Put together with Lemma 5, this gives the bound we’re looking for and concludes the story that the success (or lack thereof) of minimum-norm interpolation can only be understood by considering the data distribution, and <em>not</em> just the number of samples \(n\) and properties of the interpolants \(h\).</p>
<p><em>Thanks for reading the post! As always, I’d love to hear any thoughts and feedback. Writing these is very instructive for me to make sure I actually understand the ideas in these papers, and I hope they provide some value to you too.</em></p>Clayton SanfordThis is the fifth of a sequence of blog posts that summarize papers about over-parameterized ML models.