Skip to content
AI Article

whichllm: Hardware-Aware LLM Rankings in One Command

Forget parameter count. This open-source CLI auto-detects your GPU, pulls live HuggingFace data, and ranks local models by real benchmark scores weighted for your exact rig.

AI
DevClubHouse Curation
Jun 8, 2026 · 4 min read · 0 comments

Picking a local LLM used to mean consulting a spreadsheet, squinting at VRAM numbers, and hoping the model you grabbed wasn't already a generation stale. whichllm short-circuits that process: give it your hardware (or let it auto-detect), and it returns a ranked list of the best models you can actually run — scored on real evals, not raw parameter count.

The project hit 3.3k GitHub stars and has no mandatory setup friction — the entire workflow starts with a single uvx invocation.

uvx whichllm@latest

What's Wrong with "What Fits in My VRAM?"

The tool's README makes the core argument with a concrete example. On an RTX 4090, a naive "biggest model that fits" heuristic would hand you the 32B Qwen3 variant. whichllm ranks the 27.8B Qwen3.6 first:

#1 Qwen/Qwen3.6-27B   27.8B  Q5_K_M  score 92.8   27 t/s
#2 Qwen/Qwen3-32B     32.0B  Q4_K_M  score 83.0   31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B  Q5_K_M  score 82.7  102 t/s

The 27B model is a newer generation and outscores the 32B on benchmarks — size alone doesn't tell you that. The #3 slot is a MoE model running at 102 t/s because whichllm scores speed on active parameters while scoring quality on total parameters, which is the correct split for mixture-of-experts architectures.

Benchmark scores are drawn from a merged pool: LiveBench, Artificial Analysis, Aider, multimodal/vision evaluations, Chatbot Arena ELO, and the Open LLM Leaderboard. Every score is tagged with a confidence grade — direct, variant, base, interpolated, or self-reported — and discounted accordingly. The tool actively rejects fabricated uploader claims and cross-family score inheritance (a small fine-tune borrowing its base model's numbers). Stale leaderboard entries are demoted along each model's lineage so an old 2024 score can't outrank a current-generation one.

VRAM and Speed Modeling

The VRAM calculation isn't a lookup table. It sums weights, GQA KV-cache, activations, and overhead. Speed estimation is bandwidth-bound and accounts for per-quant efficiency, per-backend factors, MoE active/total splits, and whether you're on unified memory (Apple Silicon) versus discrete PCIe. That last distinction matters: the same model can have meaningfully different practical throughput on an M3 Max 36 GB versus a 3090 24 GB even though the M3 Max has more addressable memory.

A snapshot from the README (live data will differ):

Hardware VRAM Top pick Speed
RTX 5090 32 GB Qwen3.6-27B · Q6_K · score 94.7 ~40 t/s
RTX 4090 / 3090 24 GB Qwen3.6-27B · Q5_K_M · score 92.8 ~27 t/s
RTX 4060 8 GB Qwen3-14B · Q3_K_M · score 71.0 ~22 t/s
Apple M3 Max 36 GB Qwen3.6-27B · Q5_K_M · score 89.4 ~9 t/s
CPU only gpt-oss-20b (MoE) · Q4_K_M · score 45.2 ~6 t/s

Beyond the Ranking: The Full CLI Surface

whichllm covers several workflows beyond the default recommendation:

  • GPU simulationwhichllm --gpu "RTX 4090" lets you test any card before buying it.
  • Reverse lookupwhichllm plan "llama 3 70b" tells you what GPU you'd need for a specific model.
  • Upgrade comparisonwhichllm upgrade "RTX 4090" "RTX 5090" "H100" diffs candidates side by side.
  • One-command chatwhichllm run "qwen 2.5 1.5b gguf" spins up an isolated environment via uv, downloads the model, and drops you into an interactive session. Supports GGUF (via llama-cpp-python), AWQ, and GPTQ.
  • Code generationwhichllm snippet "qwen 7b" prints copy-paste Python for the chosen model.
  • Scripting--json output makes every command pipeline-friendly.
  • Task profiles — filter results by general, coding, vision, or math.

Data comes from the HuggingFace API with curated frozen fallbacks for offline or rate-limited environments. The benchmark snapshot date is printed under every ranking, so a stale recommendation is visible rather than silently trusted.

Install via brew install andyyyy64/whichllm/whichllm, pip install whichllm, or uv tool install whichllm. For one-offs, uvx whichllm@latest requires nothing persistent.

Discussion 0

Join the discussion

Sign in with GitHub to comment and vote.

Sign in with GitHub

No comments yet

Be the first to weigh in.

Related Reading