Local inference shootout: Llama.cpp vs. MLX on 10B and 32B Arcee SLMs

Feb 5, 2025

In this video, we run local inference on an Apple M3 MacBook with llama.cpp and MLX, two projects that optimize and accelerate SLMs on CPU platforms. For this purpose, we use two new Arcee open-source models distilled from DeepSeek-v3: Virtuoso Lite 10B and Virtuoso Medium v2 32B.

First, we download the two models from the Hugging Face hub with the Hugging Face CLI. Then, we go through the step-by-step installation procedure for llama.cpp and MLX. Next, we optimize and quantize the models to 4-bit precision for maximum acceleration. Finally, we run inference and look at performance numbers. So, who’s fastest? Watch and find out!

Local inference shootout: Llama.cpp vs. MLX on 10B and 32B Arcee SLMs

Written by Julien Simon

Responses (1)