Skip to content
llm-speed
Leaderboard/models/mlx-community-llama-3-1-8b-instruct-4bit

Llama-3.1-8B-Instruct-4bit

2 workload results across 2 hardware configurations.

Fastest local config

130.2 decode tok/s

on M3 Ultra (60-core GPU) + 96GB unified via mlx see full run

Local runs (2 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.329.20tok/s203.3tok/s669msr_h0-use1ypnb

M3 Ultra (60-core GPU) + 96GB unifiedM3 Ultra (60-core GPU) + 96GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3130.2tok/s400.3tok/s340msr_v2pbc0rq2l4

Llama-3.1-8B-Instruct-4bit on hardware