typing icon indicating copy to clipboard operation
typing copied to clipboard

Possible to distinguish between Sequence[str]/Iterable[str] and str?

Open jtatum opened this issue 7 years ago • 61 comments

If a function expects an iterable of strings, is it possible to forbid passing in a string, since strings are iterable? Seems like there are many cases where this would be an error, but I don't see an obvious way to check 't','h','i','s'.

jtatum avatar Jul 29 '16 23:07 jtatum

Since str is a valid iterable of str this is tricky. Various proposals have been made but they don't fit easily in the type system.

gvanrossum avatar Jul 29 '16 23:07 gvanrossum

I think type should never lie, even if it is a white lie. Either we should remove str.__iter__ (or make it yield something else than strs), or we should allow passing 'abc' into a function expecting Iterable[str]. Of course, I'm for second option.

Or we should have a special type name for "iterable of strings that is not a string". Strings are already special, as AnyStr shows. But although AnyStr is able to be represented using more primitive operations, I think it's too early to introduce a "type difference" operation in general. E.g. co(ntra)variance seems weird in that case.

vedgar avatar Aug 28 '16 05:08 vedgar

The problem I have with allowing Sequence[str] or Iterable[str] to be satisfied by str is that the problem of passing a str in where a sequence of (generally non single character) strs is really intended is a common API misuse that a type checker needs to be able to catch.

People can over-specify their APIs by requiring List[str] or Tuple[str] as input instead of the more general sequence or iterable but this is unnatural when teaching people how to type annotate. We'd prefer to just tell everyone to always prefer Iterable or Sequence on input.

Random thought: Would it be possible for our "magic" Text type to lose it's __iter__? So that Iterable[Text] works as desired and forbids a lone str argument?

gpshead avatar Nov 20 '18 20:11 gpshead

It requires more work on the part of API authors, but one option that might be less of a lie is to be able to delete an overload. C++ has a similar problem, where a type being passed in might "work" but you want to forbid it. So they implemented a special overload that, if matched, causes an error. See e.g. this SO thread.

Then one could define the API for Iterable[str], and delete the overload for str.

ssbr avatar Nov 20 '18 20:11 ssbr

Mypy doesn't currently have a way to remove methods in a subclass, because it would fail Liskov. But there's a hack possible. You can change the signature of a method override in a way that violates Liskov, and then add a # type: ignore to prevent mypy from complaining. Mypy will then check uses according to the override! So maybe something like this (untested) could be made to work:

class Text(Sequence[str]):
    def __iter__(self) -> None: ...  # type: ignore

gvanrossum avatar Nov 20 '18 20:11 gvanrossum

It actually doesn't work. Because currently there is a rule in mypy: "nominal first" (for various important reasons), if something works using nominal subtyping, then mypy just uses it. In this case Text is still a nominal subtype of Sequence[str].

ilevkivskyi avatar Nov 20 '18 20:11 ilevkivskyi

Hm... Maybe Text could be a Protocol that has the same methods as Sequence except for one?

gvanrossum avatar Nov 20 '18 20:11 gvanrossum

Maybe, TBH I am still not sure what are the costs/benefits here. I am afraid making such big changes in typeshed can break many existing code. But on the other hand if someone wants to do this "locally" it should be a fine solution.

ilevkivskyi avatar Nov 20 '18 21:11 ilevkivskyi

A relatively simple approach would be to special case str vs. Iterable[str] / Sequence[str] compatibility in a type checker. This behavior could be enabled through a strictness option. This issue seems quite specific to str (and unicode) so anything more drastic may not be worth it.

JukkaL avatar Nov 20 '18 21:11 JukkaL

If we assume the type checker has reasonable good dead code analysis capabilities, we could get a solution that's pretty similar to the one C++ has for free by combining @overload and NoReturn. For example:

from typing import Iterable, overload, NoReturn

@overload
def foo(x: str) -> NoReturn: ...
@overload
def foo(x: Iterable[str]) -> int: ...
def foo(x):
    if isinstance(x, str):
        raise Exception()
    else:
        return 1

def main() -> None:
    x = foo(["hello", "world"])
    reveal_type(x)

    # Maybe report dead code?
    y = foo("hello")
    reveal_type(y)

That said, idk if any type checkers actually do handle this case gracefully. Mypy, for example, will just silently ignore the last reveal_type (and warn that y needs an annotation).

Maybe to help this analysis, we could add some sort of ShouldNeverBeEncountered type? Type checkers could add a special-case that reports an error whenever they see some function call evaluates to this type, but otherwise treat it as being identical to NoReturn.

It's not a perfect solution since there's still no definitive way of telling if an Iterable[str] is a str or not, but it'd at least give library authors a way to catch some of the more obvious misuses w/o requiring their users to use a special Text-like protocol.

Michael0x2a avatar Nov 20 '18 22:11 Michael0x2a

Given the norm for most APIs is to accept the iterable and never want plain str we should aim to support that as a trivial annotation that doesn't involve multiple defs and overloading. the oddball situation where someone wants to accept the iterable and plain str should be the complicated one if complexity is needed.

gpshead avatar Nov 21 '18 01:11 gpshead

I think we're trying to expand type hints beyond their original purpose, and it shows. If I say a_string.rstrip('abc'), the function is going to work perfectly. It will, according to its specification, produce a "copy" of a_string, from which all as, bs and cs are removed at the end. There isn't going to be any "hidden type errors", "accidental mechanisms" or "unintended consequences" that the type hints are usually trying to prevent. Mypy has nothing to do here. 'abc' is just a compact way to write an iterable of strs, that yields 'a', 'b' and 'c' in that order, and then stope. What you're now trying to do is go beyond "do types match" (they do, absolutely) into "did the caller really intend to write this". And that is a dangerous crossing of responsibility boundaries. Yes, there is a sentence in PEP 484 about mypy being "a powerful linter", but I really think noone wanted mypy to take over all responsibilities of a linter. At least I hope so. In short: is passing a str as an Iterable[str] a common error? I don't know (in my experience it is not, but of course you have more experience). Does it need to be flagged by a linter? Probably. Are type hints the right way to catch it? NO.

vedgar avatar Nov 21 '18 06:11 vedgar

Maybe Text could be a Protocol that has the same methods as Sequence except for one?

Unfortunately this would make Text incompatible with str and would generally break typeshed and existing annotations.

JukkaL avatar Nov 21 '18 10:11 JukkaL

I like the idea of special-casing strings in the tool rather than in the type system, since as @gvanrossum notes, str is an iterable of str (turtles all the way!). Also this sort of type-aware linting is a neat idea, and could be done relatively easily within the typechecker because we have all the information at hand.

martindemello avatar Nov 21 '18 21:11 martindemello

all the information at hand

Do we? When I see a function that takes an Iterable[str] or Sequence[str] -- how do we know it is meant to exclude str? Or do we just assume it is always excluded?

gvanrossum avatar Nov 21 '18 21:11 gvanrossum

I was thinking always excluded; I've run into problems in both python and other languages where a function expecting an iterable was passed a string, and never (that I can think of) actually wanted a generic iterable to treat a string as an iterable of chars. That should hold even more strongly if the function specifies Iterable[str]; it is a good hint that str is being viewed as an atomic type there.

martindemello avatar Nov 21 '18 21:11 martindemello

Hm, I guess you could add it back explicitly by saying Union[str, Iterable[str]].

Would this extend to e.g. Iterable[AnyStr]?

gvanrossum avatar Nov 21 '18 22:11 gvanrossum

I think so, yes; I want to say that str|bytes|unicode should not satisfy Iterable[anything] if the flag is passed in.

martindemello avatar Nov 21 '18 22:11 martindemello

In short: is passing a str as an Iterable[str] a common error?

Yes. We have seen this specific bug multiple independent times at work. Unfortunately more than once after deployment in production.

I consider it a motivating anti-pattern for a type checker to help avoid. No other tool can validate this, it requires type information. Lets not be purists here. It improves developer productivity and code maintainability to flag this and we have a way to explicitly annotate the less common APIs that want to accept both. :)

Requiring such APIs to specify Union[str, Iterable[str]] is a good example of explicit is better than implicit.

gpshead avatar Nov 21 '18 22:11 gpshead

We have seen this specific bug multiple independent times at work.

I recall about how Rob Pike (who famously has just 'r' as his username) once got spammed when some script that sent email invoked an email-sending API with a single email address instead of a list.

gvanrossum avatar Nov 21 '18 22:11 gvanrossum

If we're going to go EIBTI route, why not be explicit where it counts?

for char in a_string.chars():

It would also help in distinguishing iterating through combined characters (graphemes), and be almost analogous to iterating through words with .split() and lines with .splitlines().

While we're at it, I would be very happy with for line in a_file.lines(), again giving the ability to be explicit with a_file.records(sep=...) or a_file.blocks(size=...).


Yes, I know what the response is going to be. But if we really don't want to change the language, maybe it really is not the problem of the language as a whole, but of a specific API. And there I don't see any problem with writing

if isinstance(an_arg, str):
    raise ValueError('an_arg is supposed to be an iterable of non-single-character strings')

Again, it's not the type that's wrong (although you can raise TypeError above if you want:). Analogy: there are many functions that declaratively accept int, but in fact work only with nonnegative numbers. In fact, I think there are more such functions than the ones that work out of the box with negative integers. Are we going to redefine that an annotation n: int really means a nonnegative integer, and require people who want int to mean int to jump through hoops?

vedgar avatar Nov 22 '18 05:11 vedgar

I found this thread because I am looking for a way to annotate some code like below:

from typing import Iterable, overload, Optional


USER_IDS = {'Alice': 1, 'Bob': 2, 'Carol': 3}


@overload
def get_user_id(name: str) -> Optional[int]: ...


@overload
def get_user_id(name: Iterable[str]) -> Iterable[Optional[int]]: ...


def get_user_id(name):
    if isinstance(name, str):
        return USER_IDS.get(name)
    else:
        return [USER_IDS.get(item) for item in name]

Currently, mypy (v0.730) gives error: Overloaded function signatures 1 and 2 overlap with incompatible return types

Not sure if anyone suggested this before, perhaps we can add a "negative" or "difference" type. Similar to Union that is an analogy to the set operator |, Diff[A, B] corresponds to the - operator, which matches anything that is type A but not type B.

Having the Diff type, we can annotate the above code as:

from typing import Diff, Iterable, overload, Optional


USER_IDS = {'Alice': 1, 'Bob': 2, 'Carol': 3}


@overload
def get_user_id(name: str) -> Optional[int]: ...


@overload
def get_user_id(name: Diff[Iterable[str], str]) -> Iterable[Optional[int]]: ...


def get_user_id(name):
    if isinstance(name, str):
        return USER_IDS.get(name)
    else:
        return [USER_IDS.get(item) for item in name]

yhlam avatar Jan 14 '20 09:01 yhlam

I ended up here looking for a way to handle a case almost identical to the above, trying to specify different overloads for str vs Sequence[str]. A trivial example:

def foo(value):
    if isinstance(value, str):
        return len(value)
    return [len(x) for x in value]

How can I annotate such a function such that

foo("abc") + 1
max(foo(["a", "bcdefg"]))

are both valid? As far as I can tell, I have to give up and say def foo(value: Sequence[str]) -> Any.

It's worth noting explicitly that this is distinct from the case in which we want to write

def bar(xs: Sequence[str]):
    return max([len(x) for x in xs] or [0])

and forbid xs="abc".

I'm not trying to use type checking to forbid using a string -- I'm trying to correctly describe how the types of arguments map to the types of potential return values.


Generalizing beyond strings, it seems like what's wanted is a way of excluding a type from an annotation which would otherwise cover it.

I'd love to see

def foo(value: str) -> int: ...
def foo(value: Excluding[str, Sequence[str]]) -> List[int]: ...

or even for this to be deduced from overloads based on their ordering:

def foo(value: str) -> int: ...
def foo(value: Sequence[str]) -> List[int]: ...

with the meaning that the first annotation takes precedence.

sirosen avatar Jul 01 '20 22:07 sirosen

I've also run into this issue. In my case, I have functions that iterate over the given parameter assuming said parameter is a List or Tuple of strings. If a string were passed in by accident, things will break. Ex:

def check_users_exist(users: Union[List[str], Tuple[str]]) -> Tuple[bool, List[Tuple[str, int]]]:
    results = []
    for user in users:
        results.append((user, request.urlopen(f'{addr}/users/{user}').status))  # This breaks if `users` is eg. "username", as we ask for `u`, `s`, etc.
    return all([199 < x[1] < 300 for x in results]), results
# Granted, an `isinstance` check would save us here, but it would be great if the type checker could alert us, too.

The annotation I'm using, Union[List[str], Tuple[str]], does work for my needs, but it's clunky and goes against the recommendation to use a more generic Sequence[str] or Iterable[str] type. Indeed, this recommendation might actually encourage bugs like this example as programmers comply with it, not remembering that str counts as both Sequence and Iterable.

It's possible to hide the problem with a type alias, I suppose, eg. NonStrSequence = Union[List[str], Tuple[str]]. Or, one could artificially impose a mutability requirement (thereby excluding str) with MutableSequence, and this would work if one is fine with also excluding tuples.

Maybe, since it doesn't seem possible to differentiate str from other Sequence types by protocol, typing could have some kind of Not[T] generic? Like, Union[Sequence[str], Not[str]]? It's not much nicer to type than what I'm already using, but does at least allow any Sequence subtype to be used while still ensuring a str can't be passed in--and not requiring the programmer to explicitly enumerate each and every possible subtype of Sequence, including unknown custom subtypes.

Unfortunately, I'm not knowledgeable enough to know if this isn't a dumb idea. I just know it would make my life a little easier if it existed 😃

MajorDallas avatar Apr 14 '21 17:04 MajorDallas

I'm yet another user overspecializing my type annotations to avoid this bug because I was using the more generic Iterable[str] and then accidentally passed a string. Please give us some way to specify "Iterable of string but not string". Whether it's a new type annotation or applies to Iterable[str] or Iterable[Text] doesn't matter much to me, but the fact that this is still impossible 5 years later makes me sad

GMNGeoffrey avatar Jul 22 '21 23:07 GMNGeoffrey

After re-reading some of the older comments...

I agree that the fundamental problem is that str is defined in terms of itself. Every element of a str is always str. "Turtles all the way down," as someone already observed. This is unique to str since any element of any other Sequence[T] is T, not Sequence[T], even if T happens to implement the Sequence protocols because Sequence isn't a concrete type like list, tuple, or str. Same for Iterable[T].

Maybe this can be leveraged in a hacky sort of way: str is the only Sequence type in Python that must be homogeneous. Every other Sequence type is fundamentally Sequence[Any]; homogeneity can only be enforced by agreement. Afaik, the only non-str Iterable in the stdlib with enforced homogeneity is array.array. So, perhaps instead of removing __iter__ from Text as has been suggested, we add some new protocol to Text and str (and array) to indicate they are a container which may only contain one type. (Edit: I realized this is certainly possible, but won't solve the problem because all it does is make str a more specialized Sequence, but it's still a Sequence.)

I think/agree the real solution is to break the recursion in the identity of str by making it a container of some other type like char (or rune if you like Go's enigmatic name choices). Many languages do this, so I don't think it's an insurmountable technical problem... But I do recognize that it would be a pretty major change to a lot of code in CPython (and every other python implementation), and one which the community is likely to reject. Hell, I don't like this idea because never caring about the subtleties of a string really being an array of integers is one of my favorite things about Python. But, I think anything less is going to be a hack that just hides the problem "most" of the time.

MajorDallas avatar Aug 19 '21 20:08 MajorDallas

I think it's safe to say that no one seriously thinks that we need python4 right now. And hopefully not for a long time yet.

If this is going to be solved, it will be by improvement to the type system, not a change in language behaviors.

This is unique to str since any element of any other Sequence[T] is T, not Sequence[T], even if T happens to implement the Sequence protocols because Sequence isn't a concrete type like list, tuple, or str. Same for Iterable[T].

I can construct custom types which look similar to str. Here's a silly example:

class InfiniDict(typing.Mapping[str, "InfiniDict"]):
    def __init__(self):
        self._data = {}

    def __len__(self):
        return len(self._data)

    def __iter__(self):
        return iter(self._data)

    def __getitem__(self, key):
        if key not in self._data:
            self._data[key] = InfiniDict()
        return self._data[key]

str is the most obvious and natural case, but there may be others.


Looking for a good example of a type system which implements negation, here's what I found:

As far as I can tell, Kotlin, Go, Scala, and Haskell have no relevant analogues to this behavior.

sirosen avatar Aug 19 '21 23:08 sirosen

FWIW pytype just gained a new behavior:

""" As of early August 2021, Pytype introduced a check that forbids matching str against the following types to prevent a common accidental bug:

Iterable[str] Sequence[str] Collection[str] Container[str] """

  • https://github.com/google/pytype/blob/master/docs/faq.md

Because this is what users mean most of the time.

-gps

On Thu, Aug 19, 2021 at 4:34 PM Stephen Rosen @.***> wrote:

I think it's safe to say that no one seriously thinks that we need python4 right now. And hopefully not for a long time yet.

If this is going to be solved, it will be by improvement to the type system, not a change in language behaviors.

This is unique to str since any element of any other Sequence[T] is T, not Sequence[T], even if T happens to implement the Sequence protocols because Sequence isn't a concrete type like list, tuple, or str. Same for Iterable[T].

I can construct custom types which look similar to str. Here's a silly example:

class InfiniDict(typing.Mapping[str, "InfiniDict"]): def init(self): self._data = {}

def __len__(self):
    return len(self._data)

def __iter__(self):
    return iter(self._data)

def __getitem__(self, key):
    if key not in self._data:
        self._data[key] = InfiniDict()
    return self._data[key]

str is the most obvious and natural case, but there may be others.

Looking for a good example of a type system which implements negation, here's what I found:

As far as I can tell, Kotlin, Go, Scala, and Haskell have no relevant analogues to this behavior.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/python/typing/issues/256#issuecomment-902320133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQXC6OI2FCIYMKEHMX3ZTT5WIHHANCNFSM4CLCTH5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

gpshead avatar Aug 20 '21 01:08 gpshead

Yeah, this was a pytype intern project, and I was pleasantly surprised with how the result turned out. We simply added a special rule to reject matching str against the specific types Greg listed there and instructed users to annotate things as Union[str, WhateverContainer[str]] if they really wanted to accept a single str. In the two weeks since launch, this project has flagged ~400 potential type errors, ~30 of which have already been verified to be genuine bugs, and we've gotten 0 user complaints about this new behavior being confusing/undesirable.

rchen152 avatar Aug 20 '21 01:08 rchen152

The Zen of Python: There should be one-- and preferably only one --obvious way to do it.

Sequence Iterable Collection Tuple ...

iperov avatar Sep 10 '21 12:09 iperov