Video — Deep Dive: Optimizing LLM inference

Julien Simon
Mar 11, 2024

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and often deliver latency and throughput that are incompatible with your cost-performance objectives.

In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Julien Simon
Julien Simon

No responses yet