Local inference shootout: Llama.cpp vs. MLX on 10B and 32B Arcee SLMs

Julien Simon
Feb 5, 2025

In this video, we run local inference on an Apple M3 MacBook with llama.cpp and MLX, two projects that optimize and accelerate SLMs on CPU platforms. For this purpose, we use two new Arcee open-source models distilled from DeepSeek-v3: Virtuoso Lite 10B and Virtuoso Medium v2 32B.

First, we download the two models from the Hugging Face hub with the Hugging Face CLI. Then, we go through the step-by-step installation procedure for llama.cpp and MLX. Next, we optimize and quantize the models to 4-bit precision for maximum acceleration. Finally, we run inference and look at performance numbers. So, who’s fastest? Watch and find out!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Julien Simon
Julien Simon

Responses (1)

What are your thoughts?