typing
typing copied to clipboard
Possible to distinguish between Sequence[str]/Iterable[str] and str?
If a function expects an iterable of strings, is it possible to forbid passing in a string, since strings are iterable? Seems like there are many cases where this would be an error, but I don't see an obvious way to check 't','h','i','s'.
Since str
is a valid iterable of str
this is tricky. Various proposals have been made but they don't fit easily in the type system.
I think type should never lie, even if it is a white lie. Either we should remove str.__iter__
(or make it yield something else than strs), or we should allow passing 'abc' into a function expecting Iterable[str]
. Of course, I'm for second option.
Or we should have a special type name for "iterable of strings that is not a string". Strings are already special, as AnyStr shows. But although AnyStr is able to be represented using more primitive operations, I think it's too early to introduce a "type difference" operation in general. E.g. co(ntra)variance seems weird in that case.
The problem I have with allowing Sequence[str]
or Iterable[str]
to be satisfied by str
is that the problem of passing a str
in where a sequence of (generally non single character) str
s is really intended is a common API misuse that a type checker needs to be able to catch.
People can over-specify their APIs by requiring List[str]
or Tuple[str]
as input instead of the more general sequence or iterable but this is unnatural when teaching people how to type annotate. We'd prefer to just tell everyone to always prefer Iterable
or Sequence
on input.
Random thought: Would it be possible for our "magic" Text type to lose it's __iter__
? So that Iterable[Text]
works as desired and forbids a lone str
argument?
It requires more work on the part of API authors, but one option that might be less of a lie is to be able to delete an overload. C++ has a similar problem, where a type being passed in might "work" but you want to forbid it. So they implemented a special overload that, if matched, causes an error. See e.g. this SO thread.
Then one could define the API for Iterable[str]
, and delete the overload for str
.
Mypy doesn't currently have a way to remove methods in a subclass, because it would fail Liskov. But there's a hack possible. You can change the signature of a method override in a way that violates Liskov, and then add a # type: ignore
to prevent mypy from complaining. Mypy will then check uses according to the override! So maybe something like this (untested) could be made to work:
class Text(Sequence[str]):
def __iter__(self) -> None: ... # type: ignore
It actually doesn't work. Because currently there is a rule in mypy: "nominal first" (for various important reasons), if something works using nominal subtyping, then mypy just uses it. In this case Text
is still a nominal subtype of Sequence[str]
.
Hm... Maybe Text could be a Protocol that has the same methods as Sequence except for one?
Maybe, TBH I am still not sure what are the costs/benefits here. I am afraid making such big changes in typeshed can break many existing code. But on the other hand if someone wants to do this "locally" it should be a fine solution.
A relatively simple approach would be to special case str
vs. Iterable[str]
/ Sequence[str]
compatibility in a type checker. This behavior could be enabled through a strictness option. This issue seems quite specific to str
(and unicode
) so anything more drastic may not be worth it.
If we assume the type checker has reasonable good dead code analysis capabilities, we could get a solution that's pretty similar to the one C++ has for free by combining @overload
and NoReturn
. For example:
from typing import Iterable, overload, NoReturn
@overload
def foo(x: str) -> NoReturn: ...
@overload
def foo(x: Iterable[str]) -> int: ...
def foo(x):
if isinstance(x, str):
raise Exception()
else:
return 1
def main() -> None:
x = foo(["hello", "world"])
reveal_type(x)
# Maybe report dead code?
y = foo("hello")
reveal_type(y)
That said, idk if any type checkers actually do handle this case gracefully. Mypy, for example, will just silently ignore the last reveal_type
(and warn that y
needs an annotation).
Maybe to help this analysis, we could add some sort of ShouldNeverBeEncountered
type? Type checkers could add a special-case that reports an error whenever they see some function call evaluates to this type, but otherwise treat it as being identical to NoReturn
.
It's not a perfect solution since there's still no definitive way of telling if an Iterable[str]
is a str
or not, but it'd at least give library authors a way to catch some of the more obvious misuses w/o requiring their users to use a special Text-like protocol.
Given the norm for most APIs is to accept the iterable and never want plain str
we should aim to support that as a trivial annotation that doesn't involve multiple defs and overloading. the oddball situation where someone wants to accept the iterable and plain str should be the complicated one if complexity is needed.
I think we're trying to expand type hints beyond their original purpose, and it shows. If I say a_string.rstrip('abc'), the function is going to work perfectly. It will, according to its specification, produce a "copy" of a_string, from which all as, bs and cs are removed at the end. There isn't going to be any "hidden type errors", "accidental mechanisms" or "unintended consequences" that the type hints are usually trying to prevent. Mypy has nothing to do here. 'abc' is just a compact way to write an iterable of strs, that yields 'a', 'b' and 'c' in that order, and then stope. What you're now trying to do is go beyond "do types match" (they do, absolutely) into "did the caller really intend to write this". And that is a dangerous crossing of responsibility boundaries. Yes, there is a sentence in PEP 484 about mypy being "a powerful linter", but I really think noone wanted mypy to take over all responsibilities of a linter. At least I hope so. In short: is passing a str as an Iterable[str] a common error? I don't know (in my experience it is not, but of course you have more experience). Does it need to be flagged by a linter? Probably. Are type hints the right way to catch it? NO.
Maybe Text could be a Protocol that has the same methods as Sequence except for one?
Unfortunately this would make Text
incompatible with str
and would generally break typeshed and existing annotations.
I like the idea of special-casing strings in the tool rather than in the type system, since as @gvanrossum notes, str is an iterable of str (turtles all the way!). Also this sort of type-aware linting is a neat idea, and could be done relatively easily within the typechecker because we have all the information at hand.
all the information at hand
Do we? When I see a function that takes an Iterable[str]
or Sequence[str]
-- how do we know it is meant to exclude str
? Or do we just assume it is always excluded?
I was thinking always excluded; I've run into problems in both python and other languages where a function expecting an iterable was passed a string, and never (that I can think of) actually wanted a generic iterable to treat a string as an iterable of chars. That should hold even more strongly if the function specifies Iterable[str]; it is a good hint that str is being viewed as an atomic type there.
Hm, I guess you could add it back explicitly by saying Union[str, Iterable[str]]
.
Would this extend to e.g. Iterable[AnyStr]
?
I think so, yes; I want to say that str|bytes|unicode should not satisfy Iterable[anything] if the flag is passed in.
In short: is passing a str as an Iterable[str] a common error?
Yes. We have seen this specific bug multiple independent times at work. Unfortunately more than once after deployment in production.
I consider it a motivating anti-pattern for a type checker to help avoid. No other tool can validate this, it requires type information. Lets not be purists here. It improves developer productivity and code maintainability to flag this and we have a way to explicitly annotate the less common APIs that want to accept both. :)
Requiring such APIs to specify Union[str, Iterable[str]]
is a good example of explicit is better than implicit.
We have seen this specific bug multiple independent times at work.
I recall about how Rob Pike (who famously has just 'r' as his username) once got spammed when some script that sent email invoked an email-sending API with a single email address instead of a list.
If we're going to go EIBTI route, why not be explicit where it counts?
for char in a_string.chars():
It would also help in distinguishing iterating through combined characters (graphemes), and be almost analogous to iterating through words with .split() and lines with .splitlines().
While we're at it, I would be very happy with for line in a_file.lines()
, again giving the ability to be explicit with a_file.records(sep=...) or a_file.blocks(size=...).
Yes, I know what the response is going to be. But if we really don't want to change the language, maybe it really is not the problem of the language as a whole, but of a specific API. And there I don't see any problem with writing
if isinstance(an_arg, str):
raise ValueError('an_arg is supposed to be an iterable of non-single-character strings')
Again, it's not the type that's wrong (although you can raise TypeError above if you want:). Analogy: there are many functions that declaratively accept int, but in fact work only with nonnegative numbers. In fact, I think there are more such functions than the ones that work out of the box with negative integers. Are we going to redefine that an annotation n: int
really means a nonnegative integer, and require people who want int
to mean int to jump through hoops?
I found this thread because I am looking for a way to annotate some code like below:
from typing import Iterable, overload, Optional
USER_IDS = {'Alice': 1, 'Bob': 2, 'Carol': 3}
@overload
def get_user_id(name: str) -> Optional[int]: ...
@overload
def get_user_id(name: Iterable[str]) -> Iterable[Optional[int]]: ...
def get_user_id(name):
if isinstance(name, str):
return USER_IDS.get(name)
else:
return [USER_IDS.get(item) for item in name]
Currently, mypy (v0.730) gives error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Not sure if anyone suggested this before, perhaps we can add a "negative" or "difference" type. Similar to Union
that is an analogy to the set operator |
, Diff[A, B]
corresponds to the -
operator, which matches anything that is type A
but not type B
.
Having the Diff
type, we can annotate the above code as:
from typing import Diff, Iterable, overload, Optional
USER_IDS = {'Alice': 1, 'Bob': 2, 'Carol': 3}
@overload
def get_user_id(name: str) -> Optional[int]: ...
@overload
def get_user_id(name: Diff[Iterable[str], str]) -> Iterable[Optional[int]]: ...
def get_user_id(name):
if isinstance(name, str):
return USER_IDS.get(name)
else:
return [USER_IDS.get(item) for item in name]
I ended up here looking for a way to handle a case almost identical to the above, trying to specify different overloads for str
vs Sequence[str]
.
A trivial example:
def foo(value):
if isinstance(value, str):
return len(value)
return [len(x) for x in value]
How can I annotate such a function such that
foo("abc") + 1
max(foo(["a", "bcdefg"]))
are both valid? As far as I can tell, I have to give up and say def foo(value: Sequence[str]) -> Any
.
It's worth noting explicitly that this is distinct from the case in which we want to write
def bar(xs: Sequence[str]):
return max([len(x) for x in xs] or [0])
and forbid xs="abc"
.
I'm not trying to use type checking to forbid using a string -- I'm trying to correctly describe how the types of arguments map to the types of potential return values.
Generalizing beyond strings, it seems like what's wanted is a way of excluding a type from an annotation which would otherwise cover it.
I'd love to see
def foo(value: str) -> int: ...
def foo(value: Excluding[str, Sequence[str]]) -> List[int]: ...
or even for this to be deduced from overloads based on their ordering:
def foo(value: str) -> int: ...
def foo(value: Sequence[str]) -> List[int]: ...
with the meaning that the first annotation takes precedence.
I've also run into this issue. In my case, I have functions that iterate over the given parameter assuming said parameter is a List or Tuple of strings. If a string were passed in by accident, things will break. Ex:
def check_users_exist(users: Union[List[str], Tuple[str]]) -> Tuple[bool, List[Tuple[str, int]]]:
results = []
for user in users:
results.append((user, request.urlopen(f'{addr}/users/{user}').status)) # This breaks if `users` is eg. "username", as we ask for `u`, `s`, etc.
return all([199 < x[1] < 300 for x in results]), results
# Granted, an `isinstance` check would save us here, but it would be great if the type checker could alert us, too.
The annotation I'm using, Union[List[str], Tuple[str]]
, does work for my needs, but it's clunky and goes against the recommendation to use a more generic Sequence[str]
or Iterable[str]
type. Indeed, this recommendation might actually encourage bugs like this example as programmers comply with it, not remembering that str
counts as both Sequence
and Iterable
.
It's possible to hide the problem with a type alias, I suppose, eg. NonStrSequence = Union[List[str], Tuple[str]]
. Or, one could artificially impose a mutability requirement (thereby excluding str
) with MutableSequence
, and this would work if one is fine with also excluding tuples.
Maybe, since it doesn't seem possible to differentiate str
from other Sequence
types by protocol, typing
could have some kind of Not[T]
generic? Like, Union[Sequence[str], Not[str]]
? It's not much nicer to type than what I'm already using, but does at least allow any Sequence
subtype to be used while still ensuring a str
can't be passed in--and not requiring the programmer to explicitly enumerate each and every possible subtype of Sequence
, including unknown custom subtypes.
Unfortunately, I'm not knowledgeable enough to know if this isn't a dumb idea. I just know it would make my life a little easier if it existed 😃
I'm yet another user overspecializing my type annotations to avoid this bug because I was using the more generic Iterable[str]
and then accidentally passed a string. Please give us some way to specify "Iterable of string but not string". Whether it's a new type annotation or applies to Iterable[str]
or Iterable[Text]
doesn't matter much to me, but the fact that this is still impossible 5 years later makes me sad
After re-reading some of the older comments...
I agree that the fundamental problem is that str
is defined in terms of itself. Every element of a str
is always str
. "Turtles all the way down," as someone already observed. This is unique to str
since any element of any other Sequence[T]
is T
, not Sequence[T]
, even if T
happens to implement the Sequence protocols because Sequence
isn't a concrete type like list
, tuple
, or str
. Same for Iterable[T]
.
Maybe this can be leveraged in a hacky sort of way: str
is the only Sequence type in Python that must be homogeneous. Every other Sequence type is fundamentally Sequence[Any]
; homogeneity can only be enforced by agreement. Afaik, the only non-str
Iterable
in the stdlib with enforced homogeneity is array.array
. So, perhaps instead of removing __iter__
from Text
as has been suggested, we add some new protocol to Text
and str
(and array
) to indicate they are a container which may only contain one type. (Edit: I realized this is certainly possible, but won't solve the problem because all it does is make str
a more specialized Sequence
, but it's still a Sequence
.)
I think/agree the real solution is to break the recursion in the identity of str
by making it a container of some other type like char
(or rune
if you like Go's enigmatic name choices). Many languages do this, so I don't think it's an insurmountable technical problem... But I do recognize that it would be a pretty major change to a lot of code in CPython (and every other python implementation), and one which the community is likely to reject. Hell, I don't like this idea because never caring about the subtleties of a string really being an array of integers is one of my favorite things about Python. But, I think anything less is going to be a hack that just hides the problem "most" of the time.
I think it's safe to say that no one seriously thinks that we need python4 right now. And hopefully not for a long time yet.
If this is going to be solved, it will be by improvement to the type system, not a change in language behaviors.
This is unique to
str
since any element of any otherSequence[T]
isT
, notSequence[T]
, even ifT
happens to implement the Sequence protocols becauseSequence
isn't a concrete type likelist
,tuple
, orstr
. Same forIterable[T]
.
I can construct custom types which look similar to str
. Here's a silly example:
class InfiniDict(typing.Mapping[str, "InfiniDict"]):
def __init__(self):
self._data = {}
def __len__(self):
return len(self._data)
def __iter__(self):
return iter(self._data)
def __getitem__(self, key):
if key not in self._data:
self._data[key] = InfiniDict()
return self._data[key]
str
is the most obvious and natural case, but there may be others.
Looking for a good example of a type system which implements negation, here's what I found:
- Rust has a negative trait implementation notation. Not the same as python because traits imply real implementations, but similar.
- Typescript has a complex behavior called Distributive Conditional Types. Among other things, it can implement a filter on a union type.
As far as I can tell, Kotlin, Go, Scala, and Haskell have no relevant analogues to this behavior.
FWIW pytype just gained a new behavior:
""" As of early August 2021, Pytype introduced a check that forbids matching str against the following types to prevent a common accidental bug:
Iterable[str] Sequence[str] Collection[str] Container[str] """
- https://github.com/google/pytype/blob/master/docs/faq.md
Because this is what users mean most of the time.
-gps
On Thu, Aug 19, 2021 at 4:34 PM Stephen Rosen @.***> wrote:
I think it's safe to say that no one seriously thinks that we need python4 right now. And hopefully not for a long time yet.
If this is going to be solved, it will be by improvement to the type system, not a change in language behaviors.
This is unique to str since any element of any other Sequence[T] is T, not Sequence[T], even if T happens to implement the Sequence protocols because Sequence isn't a concrete type like list, tuple, or str. Same for Iterable[T].
I can construct custom types which look similar to str. Here's a silly example:
class InfiniDict(typing.Mapping[str, "InfiniDict"]): def init(self): self._data = {}
def __len__(self): return len(self._data) def __iter__(self): return iter(self._data) def __getitem__(self, key): if key not in self._data: self._data[key] = InfiniDict() return self._data[key]
str is the most obvious and natural case, but there may be others.
Looking for a good example of a type system which implements negation, here's what I found:
- Rust has a negative trait implementation notation https://github.com/rust-lang/rfcs/blob/master/text/0019-opt-in-builtin-traits.md#default-and-negative-impls. Not the same as python because traits imply real implementations, but similar.
- Typescript has a complex behavior called Distributive Conditional Types https://www.typescriptlang.org/docs/handbook/release-notes/typescript-2-8.html#distributive-conditional-types. Among other things, it can implement a filter on a union type.
As far as I can tell, Kotlin, Go, Scala, and Haskell have no relevant analogues to this behavior.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/python/typing/issues/256#issuecomment-902320133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQXC6OI2FCIYMKEHMX3ZTT5WIHHANCNFSM4CLCTH5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Yeah, this was a pytype intern project, and I was pleasantly surprised with how the result turned out. We simply added a special rule to reject matching str
against the specific types Greg listed there and instructed users to annotate things as Union[str, WhateverContainer[str]]
if they really wanted to accept a single str. In the two weeks since launch, this project has flagged ~400 potential type errors, ~30 of which have already been verified to be genuine bugs, and we've gotten 0 user complaints about this new behavior being confusing/undesirable.
The Zen of Python: There should be one-- and preferably only one --obvious way to do it.
Sequence Iterable Collection Tuple ...