In this video, you will learn how to accelerate a PyTorch training job with a cluster of Intel Sapphire Rapids servers running on AWS. We will use the Intel oneAPI Collective Communications Library (CCL) to distribute the job, and the Intel Extension for PyTorch (IPEX) library to automatically put the new CPU instructions to work. As both libraries are already integrated with the Hugging Face transformers library, we will be able to run our sample scripts out of the box without changing a line of code.

--

--

In this video, I show you how to accelerate Transformer inference with Optimum, an open-source library by Hugging Face, and Better Transformer, a PyTorch extension available since PyTorch 1.12.

Using an AWS instance equipped with an NVIDIA V100 GPU, I start from a couple of models that I previously fine-tuned: a DistilBERT model for text classification and a Vision Transformer model for image classification. I first benchmark the original models, then I use Optimum and Better Transformer to optimize them with a single line of code, and I benchmark them again. This simple process delivers a 20–30% percent speedup with no accuracy drop!

--

--

In this video, I show you how to accelerate Transformer inference with Inferentia, a custom chip designed by AWS.

Starting from a Hugging Face BERT model that I fine-tuned on AWS Trainium (https://youtu.be/HweP7OYNiIA), I compile it with the Neuron SDK for Inferentia. Then, using an inf1.6xlarge instance (4 Inferentia chips, 16 Neuron Cores), I show you how to use pipeline mode to predict at scale, reaching over 4,000 predictions per second at 3-millisecond latency 🤘

--

--