Large language models (LLMs) are an exciting area of research, but for the most part, attention has been focused on increasing model size, rather than making them more efficient. Here, we present a startling finding: for a given compute budget, the currently optimal models are significantly smaller than what largely determines the current state of the art. We find that the Chinchilla model, a 70B parameter model, uses the same compute budget as Gopher (280B), yet outperforms it and other state-of-the-art LLMs across a wide range of downstream evaluation tasks. To achieve this, we uniformly scale up the training dataset size while decreasing model size, a strategy that results in improved performance and enables us to better understand the relationship between scale, compute, and performance. We derive a simple formula to determine the optimal model size for a given FLOPs budget and demonstrate that this formula holds across 5 orders of magnitude. Our results indicate that the current paradigm for LLM development is significantly undertrained, leading to wasted compute and a skewed understanding of the capabilities of LLMs. By providing a more efficient training paradigm, we hope to accelerate progress in the field and enable more researchers to explore the exciting possibilities of LLMs.
{
"id": "22631b5b-219a-4c69-b7c0-9c80fdae9476",
"title": "Chinchilla – Training Compute-Optimal Large Language Models (2022)",
"slug": "chinchilla-training-compute-optimal-large-language-models",
"video_url": "https://www.youtube.com/watch?v=BgaDn6Bx3Yw",
"url": "https://arxiv.org/abs/2203.15556",
"resource_category": "research",
"image_url": null,
"thumbnail_url": null
}