Skip to content
llm-speed

Qwen3-32B-4bit

2 workload results across 2 hardware configurations.

Fastest local config

34.4 decode tok/s

on M3 Ultra (60-core GPU) + 96GB unified via mlx see full run

Local runs (2 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.37.16tok/s25.65tok/s4,288msr_pnrrpcdqfo4

M3 Ultra (60-core GPU) + 96GB unifiedM3 Ultra (60-core GPU) + 96GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.334.41tok/s95.12tok/s1,156msr_anmmc80-aoq

Community folklore

61 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    100.0tok/s qwen 3 32b on RTX 3090 via vllm awq

    I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it allow me to decide use the code or ask better code u

    source: Reddit · u/Kasatka06 · 2025-05-03

  • communityconfidence 75%

    80.00tok/s qwen 3 32b on RTX 3090 via vllm awq

    I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it all

    source: Reddit · u/Kasatka06 · 2025-05-03

  • communityconfidence 65%

    40.00tok/s Qwen3 32b via llama.cpp gptq

    Note that you will get 40t/s for Qwen3 32b gptq 4bit with 4x tensor parallelism. Qwen3 235B Q4_1 will work with llama.cpp and 5xMI50 at 19t/s initially. But expect that

    source: Reddit · u/MLDataScientist · 2025-07-22

  • communityconfidence 60%

    180.0tok/s Qwen3-32B on M4 via lm-studio

    its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    45.00tok/s Qwen3-32B on Fireworks AI via hosted-api

    (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    180.0tok/s Qwen3-32B on M4 via lm-studio

    its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    45.00tok/s Qwen3-32B on Fireworks AI via hosted-api

    (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

See all 61 claims for Qwen3-32B-4bit

Qwen3-32B-4bit on hardware