Dev Tools Article

You're Running Your Python Type-Checkers on the Wrong Code

With mypy, Pyrefly, Pyright, ty, and Zuban all competing for CI minutes, a new framework argues library maintainers have their priorities exactly backwards.

DevClubHouse Curation

Jun 8, 2026 · 4 min read · 0 comments

The Python type-checking ecosystem has quietly fragmented. Where once you had mypy and maybe pyright, you now have five actively-developed checkers—mypy, Pyrefly, Pyright, ty, Zuban—with more apparently on the way. For application developers that's annoying. For library maintainers it's a genuine crisis: do you gate CI on all five?

A post from the Pyrefly team makes the case that most maintainers are solving the wrong problem entirely.

The backwards pattern

The typical setup: a library runs a type-checker over its source code in CI, and leaves the test suite untyped or ignores it. The Pyrefly post argues this is exactly inverted.

Your internal implementation choices—which formatter you use, how you structure modules, which assertions you prefer in tests—are invisible to your users. What your users do care about is whether your public API works cleanly with their type-checker. And you have zero control over which one they pick.

The prescription:

Test suite: run as many type-checkers as possible
Source code: run at least one (your choice)

The logic is sound. Type-checking your test suite is type-checking your public API surface, exercised as a real caller would use it. Type-checking your internals is type-checking your internals.

Why five checkers disagree

The Python typing spec defines a shared baseline, but it leaves grey areas—particularly around under-specified annotations. In those gaps, different tools make different calls: some maximize strictness and accept false positives, others stay lenient to support gradual adoption.

That divergence is mostly invisible at the call site. Consider the Polars DataType.__eq__ example from the post. The method returns different types depending on its argument, which requires overloads and clashes with Python's expectation that __eq__ returns bool. Getting it past all three of mypy, Pyrefly, and ty simultaneously produces this:

@overload  # type: ignore[override]
def __eq__(  # pyrefly: ignore[bad-override]
    self, other: pl.DataTypeExpr
) -> pl.Expr: ...

@overload
def __eq__(self, other: PolarsDataType) -> bool: ...

def __eq__(self, other: pl.DataTypeExpr | PolarsDataType) -> pl.Expr | bool:
    # ty: ignore[invalid-method-override]
    # pyright: ignore[reportIncompatibleMethodOverride]

Four separate suppression comments for seven lines of code. Scale that across a real codebase and you end up with an unmaintainable mess of tool-specific annotations that no one wants to own.

Here's the key finding: all five checkers—mypy, Pyrefly, Pyright, ty, and Zuban—accepted the test for that same function without complaint. The checkers argue about how the implementation should be written; they agree on what the public API does. That's the seam you want to exploit.

The Polars case study

Polars is the post's working example. Integrating Pyrefly into Polars' CI against the test suite was described as "relatively painless." Running it against the full source code was a larger lift—Pyrefly is stricter than mypy and surfaced both legitimate issues and a handful of Pyrefly bugs (most of which were fixed in the v1 release). The source-code integration is still being tackled incrementally.

The takeaway isn't that you should never type-check your internals—it's about sequencing. Getting your public API clean across the major checkers delivers immediate value to every user of your library. Cleaning up your internal type hygiene is worthwhile but lower-urgency and can be done with whichever single checker suits your team.

The practical checklist

If you maintain a Python library:

Add type-checker runs to your test suite CI, starting with pyright and mypy since those have the largest user bases. Add Pyrefly, ty, and Zuban as the ecosystem stabilizes.
Pick one checker for your source code based on where you want to sit on the strict-vs-lenient spectrum. Strict options catch more bugs; lenient ones integrate more easily into existing codebases.
Don't let suppression comments proliferate in your implementation trying to satisfy every tool—fix the tests first, and investigate implementation warnings individually.

The fragmentation isn't going away. The typing spec's grey areas mean tool authors will keep making different calls. But if your tests pass all five checkers, your users are covered regardless of which one they reach for.

#Python #Type Checking #Mypy #Pyright #Pyrefly #Static Analysis

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

You're Running Your Python Type-Checkers on the Wrong Code

The backwards pattern

Why five checkers disagree

The Polars case study

The practical checklist

Discussion 0

Related Reading

nixidy: Ditch the 600-Line Helm Values File, Use Nix Instead

TurboVec: A Rust-Powered Quantised Vector Index That Fits 10M Docs in 4 GB

MarkItDown: Microsoft's Swiss-Army Converter for LLM Document Ingestion

Intuned Wants to Be the Deployment Layer for Your Playwright Automations