Video: Accelerate Transformer inference with Optimum and ONNX

In this video, I show you how to accelerate Transformer inference with Optimum, an open source library by Hugging Face, and ONNX.

I start from a DistilBERT model fine-tuned for text classification, export it to ONNX format, then optimize it, and finally quantize it. Running benchmarks on an AWS c6i instance (Intel Ice Lake architecture), we speed up the original model more than 2.5x and divide its size by 50%, with just a few lines of simple Python code and without any accuracy drop!



Chief Evangelist, Hugging Face (

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store