AI Article

whichllm: Hardware-Aware LLM Rankings in One Command

Forget parameter count. This open-source CLI auto-detects your GPU, pulls live HuggingFace data, and ranks local models by real benchmark scores weighted for your exact rig.

DevClubHouse Curation

Jun 8, 2026 · 4 min read · 0 comments

Picking a local LLM used to mean consulting a spreadsheet, squinting at VRAM numbers, and hoping the model you grabbed wasn't already a generation stale. whichllm short-circuits that process: give it your hardware (or let it auto-detect), and it returns a ranked list of the best models you can actually run — scored on real evals, not raw parameter count.

The project hit 3.3k GitHub stars and has no mandatory setup friction — the entire workflow starts with a single uvx invocation.

uvx whichllm@latest

What's Wrong with "What Fits in My VRAM?"

The tool's README makes the core argument with a concrete example. On an RTX 4090, a naive "biggest model that fits" heuristic would hand you the 32B Qwen3 variant. whichllm ranks the 27.8B Qwen3.6 first:

#1 Qwen/Qwen3.6-27B   27.8B  Q5_K_M  score 92.8   27 t/s
#2 Qwen/Qwen3-32B     32.0B  Q4_K_M  score 83.0   31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B  Q5_K_M  score 82.7  102 t/s

The 27B model is a newer generation and outscores the 32B on benchmarks — size alone doesn't tell you that. The #3 slot is a MoE model running at 102 t/s because whichllm scores speed on active parameters while scoring quality on total parameters, which is the correct split for mixture-of-experts architectures.

Benchmark scores are drawn from a merged pool: LiveBench, Artificial Analysis, Aider, multimodal/vision evaluations, Chatbot Arena ELO, and the Open LLM Leaderboard. Every score is tagged with a confidence grade — direct, variant, base, interpolated, or self-reported — and discounted accordingly. The tool actively rejects fabricated uploader claims and cross-family score inheritance (a small fine-tune borrowing its base model's numbers). Stale leaderboard entries are demoted along each model's lineage so an old 2024 score can't outrank a current-generation one.

VRAM and Speed Modeling

The VRAM calculation isn't a lookup table. It sums weights, GQA KV-cache, activations, and overhead. Speed estimation is bandwidth-bound and accounts for per-quant efficiency, per-backend factors, MoE active/total splits, and whether you're on unified memory (Apple Silicon) versus discrete PCIe. That last distinction matters: the same model can have meaningfully different practical throughput on an M3 Max 36 GB versus a 3090 24 GB even though the M3 Max has more addressable memory.

A snapshot from the README (live data will differ):

Hardware	VRAM	Top pick	Speed
RTX 5090	32 GB	Qwen3.6-27B · Q6_K · score 94.7	~40 t/s
RTX 4090 / 3090	24 GB	Qwen3.6-27B · Q5_K_M · score 92.8	~27 t/s
RTX 4060	8 GB	Qwen3-14B · Q3_K_M · score 71.0	~22 t/s
Apple M3 Max	36 GB	Qwen3.6-27B · Q5_K_M · score 89.4	~9 t/s
CPU only	—	gpt-oss-20b (MoE) · Q4_K_M · score 45.2	~6 t/s

Beyond the Ranking: The Full CLI Surface

whichllm covers several workflows beyond the default recommendation:

GPU simulation — whichllm --gpu "RTX 4090" lets you test any card before buying it.
Reverse lookup — whichllm plan "llama 3 70b" tells you what GPU you'd need for a specific model.
Upgrade comparison — whichllm upgrade "RTX 4090" "RTX 5090" "H100" diffs candidates side by side.
One-command chat — whichllm run "qwen 2.5 1.5b gguf" spins up an isolated environment via uv, downloads the model, and drops you into an interactive session. Supports GGUF (via llama-cpp-python), AWQ, and GPTQ.
Code generation — whichllm snippet "qwen 7b" prints copy-paste Python for the chosen model.
Scripting — --json output makes every command pipeline-friendly.
Task profiles — filter results by general, coding, vision, or math.

Data comes from the HuggingFace API with curated frozen fallbacks for offline or rate-limited environments. The benchmark snapshot date is printed under every ranking, so a stale recommendation is visible rather than silently trusted.

Install via brew install andyyyy64/whichllm/whichllm, pip install whichllm, or uv tool install whichllm. For one-offs, uvx whichllm@latest requires nothing persistent.

#Open Source #Developer Tools #Llm #Local Llm #Cli #Benchmarks

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

whichllm: Hardware-Aware LLM Rankings in One Command

What's Wrong with "What Fits in My VRAM?"

VRAM and Speed Modeling

Beyond the Ranking: The Full CLI Surface

Discussion 0

Related Reading

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

CopilotKit Bridges the Agent-to-UI Gap with Generative Components and the AG-UI Protocol

Agent Reach Gives AI Agents Live Eyes on Twitter, Reddit, and GitHub — No API Keys Required

Open Notebook: Self-Host Your Own NotebookLM with 18+ AI Providers