Large Language Models: A New Moore’s Law?

This is your Brain on Deep Learning

Researchers estimate that the human brain contains an average of 86 billion neurons and 100 trillion synapses. It’s safe to assume that not all of them are dedicated to language either. Interestingly, GPT-4 is expected to have about 100 trillion parameters… As crude as this analogy is, shouldn’t we wonder whether building language models that are about the size of the human brain is the best long-term approach?

Deep Learning, Deep Pockets?

As you would expect, training a 530-billion parameter model on humongous text datasets requires a fair bit of infrastructure. In fact, Microsoft and NVIDIA used hundreds of DGX A100 multi-GPU servers. At $199,000 a piece, and factoring in networking equipment, hosting costs, etc., anyone looking to replicate this experiment would have to spend close to $100 million dollars. Want fries with that?

That Warm Feeling is your GPU Cluster

For all its engineering brilliance, training Deep Learning models on GPUs is a brute force technique. According to the spec sheet, each DGX server can consume up to 6.5 kilowatts. Of course, you’ll need at least as much cooling power in your datacenter (or your server closet). Unless you’re the Starks and need to keep Winterfell warm in winter, that’s another problem you’ll have to deal with.

So?

Am I excited by Megatron-Turing NLG 530B and whatever beast is coming next? No. Do I think that the (relatively small) benchmark improvement is worth the added cost, complexity and carbon footprint? No. Do I think that building and promoting these huge models is helping organizations understand and adopt Machine Learning ? No.

Use Pretrained Models

In the vast majority of cases, you won’t need a custom model architecture. Maybe you’ll want a custom one (which is a different thing), but there be dragons. Experts only!

Use Smaller Models

When evaluating models, you should pick the smallest one that can deliver the accuracy you need. It will predict faster and require fewer hardware resources for training and inference. Frugality goes a long way.

Fine-Tune Models

If you need to specialize a model, there should be very few reasons to train it from scratch. Instead, you should fine-tune it, that is to say train it only for a few epochs on your own data. If you’re short on data, maybe of one these datasets can get you started.

  • Less data to collect, store, clean and annotate,
  • Faster experiments and iterations,
  • Fewer resources required in production.

Use Cloud-Based Infrastructure

Like them or not, cloud companies know how to build efficient infrastructure. Sustainability studies show that cloud-based infrastructure is more energy and carbon efficient than the alternative: see AWS, Azure, and Google. Earth.org says that while cloud infrastructure is not perfect, “[it’s] more energy efficient than the alternative and facilitates environmentally beneficial services and economic growth.

Optimize Your Models

From compilers to virtual machines, software engineers have long used tools that automatically optimize their code for whatever hardware they’re running on.

  • Specialized hardware that speeds up training (Graphcore, Habana) and inference (Google TPU, AWS Inferentia).
  • Pruning: remove model parameters that have little or no impact on the predicted outcome.
  • Fusion: merge model layers (say, convolution and activation).
  • Quantization: storing model parameters in smaller values (say, 8 bits instead of 32 bits)

Conclusion

Large language model size has been increasing 10x every year for the last few years. This is starting to look like another Moore’s Law.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store