Skip to content
llm-speed
Leaderboard/models/deepseek-v3

V3

No benchmarks for V3 yet.

No V3 benchmarks yet.

Run on your hardware to populate this page: pipx install llm-speed && llm-speed bench

$ pipx install llm-speed && llm-speed bench

Community folklore

30 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 75%

    5.00tok/s deepseek-V3 on Together AI via hosted-api FP4

    VRAM, higher vRAM might helpful for long-context, PCIe5 might helpful for prefill. DDR4 Epyc or Icelake-SP Xeons with GPU may only reach \~5t/s(from my deepseek-V3 experience), which is pretty slow. - Serious one: 4x RTX Pro 6000 Blackwell together with modern Xeon or Epyc shoul…

    source: Reddit · u/lly0571 · 2025-09-22

  • communityconfidence 65%

    10.00tok/s DeepSeek V3 on RTX 4090 FP8

    g fast. Expert weights on CPU (4-bit GGUF): All the huge MoE banks live in system RAM and load as needed. End result: I’m seeing about \~10 tokens/sec with a 32K context window—pretty smooth for local tinkering. KTransformers made it so easy with its Docker image. It handles …

    source: Reddit · u/texasdude11 · 2025-04-27

  • communityconfidence 65%

    10.00tok/s DeepSeek V3 on RTX 4090 FP8

    Finally got ~10t/s DeepSeek V3-0324 hybrid (FP8+Q4_K_M) running locally on my RTX 4090 + Xeon with with 512GB RAM, KTransformers and 32K context Hey everyone

    source: Reddit · u/texasdude11 · 2025-04-27

  • communityconfidence 65%

    56.00tok/s deepseek v3.2 via vllm FP16

    : With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …

    source: Reddit · u/ai-infos · 2026-04-01

  • communityconfidence 65%

    56.00tok/s deepseek v3.2 via vllm FP16

    : With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …

    source: Reddit · u/ai-infos · 2026-04-01

  • communityconfidence 65%

    56.00tok/s deepseek v3.2 via vllm FP16

    : With the vllm-gfx906-mobydick fork, you can also run smaller recent models (as the base is vllm v0.17.1) like **Qwen3.5 27B** (reaching **56 tok/s** at MTP5 and TP4 but it fits also on 1 MI50 32GB with 65k context; maybe later, if you are interested in, I can also make another …

    source: Reddit · u/ai-infos · 2026-04-01

  • communityconfidence 60%

    25.00tok/s DeepSeek V3 on H100 via vllm

    other developers, unlimited tokens Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until…

    source: HN · u/jrandolf · 2026-04-04

  • communityconfidence 60%

    10.85tok/s DeepSeek-V3.1 on RTX 5090 via llama.cpp

    gh the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulation to cool the RAM…

    source: Reddit · u/sloptimizer · 2026-01-23

  • communityconfidence 60%

    151.8tok/s DeepSeek-V3.1 on RTX 5090 via llama.cpp

    is threaded through the system popping out through a small gap left by RTX 5090). DeepSeek-V3.1-Terminus with context = 37279 tokens: PP = 151.76 tps, TG = 10.85 tps Some things I discovered running local LLMs: * For water-cooled CPU systems, there is not enough air circulatio…

    source: Reddit · u/sloptimizer · 2026-01-23

  • communityconfidence 60%

    10.00tok/s DeepSeek V3 on RTX 4090 via llama.cpp

    nt=share_button) \- DeepSeek V3-0324 Q4\_K\_M with 512GB RAM DDR5 RAM 4800MHz and one RTX 4090; Prompt eval speed: 40 t/s; generation: \~10 t/s; DDR5 with 8 channels certainly improves the speed but it is not blazing fast. 3. If you have more VRAM, you can try --ubatch-size …

    source: Reddit · u/MLDataScientist · 2025-06-01

See all 30 claims for V3