Video — Deep dive — Better Attention layers for Transformer models

Julien Simon
Feb 12, 2024

The self-attention mechanism is at the core of transformer models. As amazing as it is, it requires a significant amount of computing and memory bandwidth, leading to scalability issues as models get more complex and context length increases.

In this video, we’ll quickly review the computation involved in the self-attention mechanism and its multi-head variant. Then, we’ll discuss newer attention implementations focused on compute and memory optimizations, namely Multi-Query Attention, Group-Query Attention, Sliding Window Attention, Flash Attention v1 and v2, and Paged Attention.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Julien Simon
Julien Simon

No responses yet

What are your thoughts?