typeshed Definition of `str.split`, `str.rsplit` and `str.splitlines` results in false positive errors

trafficstars

I recently fixed a bug in pyright that caused a false negative when attempting to assign a list[LiteralString] to a list[str]. As we know, the type parameter for list is invariant, so these types are not consistent.

from typing import LiteralString
v1: list[LiteralString] = ["a", "b"]
v2: list[str] = v1  # Should be a type violation

With this bug fix now in place, the definitions for the methods str.split, str.rsplit and str.splitlines are problematic. These three methods are defined using overloaded signatures, and the first overload returns list[LiteralString]. This means the following code now results in a type violation.

def process(x: list[str]) -> str:
    return " ".join(x)

process("1,2,3".split(","))  # Type violation

This is going to generate a lot of new type errors in existing code bases.

I initially attempted to work around this by adding a heuristic in pyright's overload matching logic so it deprioritized the first overload in this case, but this heuristic broke other code and really isn't sound. I've therefore decided to remove the heuristic.

I don't have any good ideas about how to fix this, but I wanted to open the issue to start the discussion to see if anyone had a suggestion. If we're going to fix it, I think the fix needs to be in the typeshed stubs. I don't see a way to fix this in a type checker — at least while remaining true to the type system.

One possible solution is to omit the LiteralString overloads for these three methods. This would eliminate the type violation error in the common case, but it would introduce new errors in cases where callers expect to see a return result of list[LiteralString].

Note that this doesn't currently affect mypy users because mypy has not implemented support for LiteralString.

Oct 14 '23 20:10 erictraut

I've run into similar issues with overloads before, where sometimes you are forced to have a fairly narrow first overload, since it overlaps with a more general overload, but then you run into exactly this type of issue where type inference sometimes gives you too narrow of a type in cases where it doesn't know which overload to pick, so it defaults to the first one.

It would probably make sense to decouple overload order from what's considered the nominal fallback overload if we don't know which overload matches. Something like an optional default parameter on the overload decorator, or a new decorator default_overload. Something like PEP696 but for overloads.

It doesn't come up very often, but when it does, it is always frustrating, because there's no satisfying way to solve it, you always end up having to get rid of the more narrow overload and have to deal with false negatives instead.

Oct 14 '23 21:10 Daverball

Broader questions about how @overload works, and possible ways of improving @overload, are somewhat off-topic for the typeshed issue tracker @Daverball :) https://discuss.python.org/c/typing/32 would be a better place to discuss that kind of thing.

For this specific question, I'd like the opinions of @gbleaney (author of #7725, and co-author of PEP-675) and @pradeep90 (the other co-author of PEP-675).

I'll note that the PEP specifically lists these methods in Appendix C as methods that should be typed with LiteralString, so we'll be technically contravening the specification if we stop using LiteralString for these methods. (I don't think it's that big a deal, though.)

Oct 15 '23 12:10 AlexWaygood

I don't have much to contribute, but note that the underlying problem is users using list[...] as an argument annotation. This is understandable as it's the most obvious and ergonomic approach (and foo: list[str] arguably looks "cleaner" than foo: Iterable[str]), but maybe we could find a way to make it easier for users to use the "correct" annotation. For example allowing foo: iter[str] (although I'm aware that iter() returns an iterator) or foo: seq[str], without needing an import.

Oct 16 '23 10:10 srittau

One possible solution is to omit the LiteralString overloads for these three methods. This would eliminate the type violation error in the common case, but it would introduce new errors in cases where callers expect to see a return result of list[LiteralString].

For now, I suppose I lean towards this solution. Anything else seems too breaking, in my opinion :/

Oct 16 '23 10:10 AlexWaygood

One possible solution is to omit the LiteralString overloads for these three methods. This would eliminate the type violation error in the common case, but it would introduce new errors in cases where callers expect to see a return result of list[LiteralString].

For now, I suppose I lean towards this solution. Anything else seems too breaking, in my opinion :/

I would like to see the impact this would have, compared to the impact of not changing it. There is a reason, the LiteralString overload was introduced, so removing it would most likely negatively affect users as well.

Oct 17 '23 08:10 srittau

I'm inclined to agree with @srittau and say that the underlying problem is the typing on process in @erictraut's example, but I understand that that's not much consolation for anyone who's code is full of LiteralString type errors all of a sudden, when they don't even use LiteralString explicitly anywhere. LiteralString is likely always going to be a niche feature, and it's existence shouldn't be causing pain for those who don't use it (maybe this is my C++ "don't pay for what you don't use" mindset sneaking in).

I have a few concerns with the proposal of omitting the LiteralString overrides for str.split, str.rsplit and str.splitlines:

Where do we draw a principled line between where it makes sense to remove these overrides to benefit existing code type checking, vs where do we leave them for the convenience of future users. For example, it's not uncommon to break a literal string into chars in a list comprehension ([c for c in "string"]), which could create the same problem outlined in this issue. If we do this, I'd like for us to have a clear line for when we remove these overrides vs ask people to change their typing, rather than doing it piecemeal and inconsistently.
These overloads have existed for more than a year, so some people may now rely on them. Are we going to cause type checking pain by deleting these?

My personal preference would be for the overrides to stay, and for fixing/suppressing errors like what @erictraut outlined to be a part of the normal upgrade path. Internally at Meta, when a Pyre upgrade leads to new type errors the we can't automate fixing, we use a Pyre flag to automatically introduce error suppression at the sites of the newly surfaced errors. That lets us keep the type check passing and prevent new issues from being introduced, while not forcing us to block upgrades on massive refactors.

Oct 19 '23 15:10 gbleaney

My view is usage of LiteralString is closer to a strict typing feature vs one codebase new to typing should have to handle. Pyright, mypy, and I'm assuming other type checkers have flags/strict mode to control which rules are enabled. I think LiteralString ideally should have a flag where enableLiteralString (enabled in --strict like modes) would treat it as separate type and use rules documented in PEP, while enableLiteralString: False (default) would instead treat it as an alias to str.

This would not be first typing feature that has different behavior depending on flag. override PEP explicitly specifies that type checkers should offer strict enforcement as, Strict enforcement should be opt-in.

Oct 20 '23 07:10 hmc-cs-mdrissi

override PEP explicitly specifies that type checkers should offer strict enforcement

That's different. The override decorator doesn't affect type consistency rules. I don't think type consistency rules should change based on the strictness mode of a type checker.

Oct 20 '23 15:10 erictraut

I commented on this briefly where it came up for someone in the python discord's typing channel. I believe that this is neither the typeshed being wrong nor pyright being wrong. That isn't to say this can't be improved for users though.

It essentially comes from that Literals (and LiteralStr) are refinement types (academic theory, not an established term within python's type system) without any rules other than subtyping to go off of. I think this is worth examining in more detail as part of the spec. It should be okay to promote a refinement type entirely implicitly to a less specific type in some cases, even those involving invariant typevars, and remain type safe, but the rules around this would need review and specification. This would also only ever be safe going from the literal to the type the literal belongs to.

I can help with exploring the ramifications of changing this after other things on my todo list, it may or may not be the best option for python specifically.

Jan 19 '24 04:01 mikeshardmind

typeshed typeshed copied to clipboard

Definition of `str.split`, `str.rsplit` and `str.splitlines` results in false positive errors

typeshed
typeshed copied to clipboard