# A Neural Scaling Law from the Dimension of the Data Manifold

@article{Sharma2020ANS, title={A Neural Scaling Law from the Dimension of the Data Manifold}, author={Utkarsh Sharma and Jared Kaplan}, journal={ArXiv}, year={2020}, volume={abs/2004.10802} }

When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for… Expand

#### Supplemental Code

#### Figures, Tables, and Topics from this paper

#### 12 Citations

Explaining Neural Scaling Laws

- Computer Science, Physics
- ArXiv
- 2021

This work identifies variance-limited and resolution-limited scaling behavior for both dataset and model size, and identifies four related scaling regimes with respect to the number of model parameters P and the dataset size D. Expand

Scaling Laws for Autoregressive Generative Modeling

- Computer Science
- ArXiv
- 2020

The case that scaling laws have important implications for neural network performance, including on downstream tasks is strengthened, as empirical scaling laws for the cross-entropy loss are identified. Expand

Scaling Laws for Transfer

- Computer Science
- ArXiv
- 2021

This work finds that pre-training effectively multiplies the fine-tuning dataset size, and believes the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). Expand

Learning Curve Theory

- Computer Science, Mathematics
- ArXiv
- 2021

This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution. Expand

A Scaling Law for Synthetic-to-Real Transfer: A Measure of Pre-Training

- Computer Science
- ArXiv
- 2021

A simple and general scaling law is observed that consistently describes learning curves in various tasks, models, and complexities of synthesized pre-training data. Expand

Distributional Generalization: A New Kind of Generalization

- Computer Science, Mathematics
- ArXiv
- 2020

We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to… Expand

Topological Obstructions to Autoencoding

- Computer Science, Physics
- ArXiv
- 2021

This analysis ground this analysis in the discussion of a mock “bump hunt” in which the autoencoder fails to identify an anomalous “signal” for reasons tied to the intrinsic topology of n-particle phase space. Expand

Limits to Depth Efficiencies of Self-Attention

- Computer Science, Mathematics
- NeurIPS
- 2020

By identifying network width as a limiting factor, the analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity. Expand

MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra

- Biology
- 2021

The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and to generate candidate structures that are very close or identical to those of the ‘true’ molecules directly opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra. Expand

Towards Continual Reinforcement Learning: A Review and Perspectives

- Computer Science
- ArXiv
- 2020

A taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting is provided, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Expand

#### References

SHOWING 1-10 OF 30 REFERENCES

Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

- Mathematics, Physics
- ArXiv
- 2019

The results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data is defined in terms of how the distance between nearest data points depends on $n$. Expand

Intrinsic dimension of data representations in deep neural networks

- Computer Science, Mathematics
- NeurIPS
- 2019

The intrinsic dimensionality of data-representations is studied, i.e. the minimal number of parameters needed to describe a representation, and it is found that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Expand

A Constructive Prediction of the Generalization Error Across Scales

- Computer Science, Mathematics
- ICLR
- 2020

This work presents a functional form which approximates well the generalization error in practice, and shows that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data. Expand

Scaling Laws for Neural Language Models

- Computer Science, Mathematics
- ArXiv
- 2020

Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence. Expand

Measuring the Intrinsic Dimension of Objective Landscapes

- Computer Science, Mathematics
- ICLR
- 2018

Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where it is concluded that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. Expand

Learning Multiple Layers of Features from Tiny Images

- Computer Science
- 2009

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network. Expand

Estimating the intrinsic dimension of datasets by a minimal neighborhood information

- Computer Science, Mathematics
- Scientific Reports
- 2017

A new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample is proposed, which enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. Expand

Adam: A Method for Stochastic Optimization

- Computer Science, Mathematics
- ICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand

Efficient Representation of Low-Dimensional Manifolds using Deep Networks

- Computer Science, Mathematics
- ICLR
- 2017

It is shown that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Expand

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

- Computer Science, Mathematics
- ICML
- 2019

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand