Skip to content
llm-speed
Leaderboard/models/qwen2-5-0-5b-instruct

Qwen2.5-0.5B-Instruct-4bit

2 workload results across 1 hardware configuration.

Fastest local config

286.5 decode tok/s

on M3 Pro (18-core GPU) + 36GB unified via mlx (4bit) see full run

Local runs (2 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit286.5tok/s656.2tok/s200msr_akcbpx5vcqa
chat-shortmlx@0.31.34bit282.6tok/s668.8tok/s196msr_bftqtkilvoe

Qwen2.5-0.5B-Instruct-4bit on hardware