Video: SLM inference on AWS Graviton4
CPU inference? Hell yes.

In this episode, Lorenzo Winfrey, Jeff Underhill, and I discuss there’s hope beyond huge closed models and expensive GPU instances. Yes, AWS Graviton4 packs a punch and is possibly the most cost-effective platform for SLM inference. To prove our point, I show how to quantize and run our Llama-3.1-SuperNova-Lite model on a small Graviton4 instance. You won’t believe the text generation speed 😃