Should a string of length N match a list pattern of that length?
For example,
match "ab":
case [a, b]: print("Two values", a, b)
case c: print("One value:", c)
Which branch should it take? My current prototype chooses case c. My reasoning is that many people have complained over the years that strings being iterable is a pain in various contexts, and I suspect this is one of those contexts.
Yes, I would also want my pattern matcher to choose case c. Having given it some thought, I would feel it strange if "ab" matched [a, b]. Strings are not only sequences, but also atomic values (that they take a special role can also be seen by things like, e.g., s = s[0], which for strings of length one ends up being circular).
In principle, we could allow "reverse concatenation" for string patterns, if that is needed/desired. For instance, case "a" + x + "c": would match any string that started with "a" and ended in "c". On the other hand, however, we already have very strong string matching capabilities with regular expressions and I therefore believe that we can just leave strings to be atomic values.
The only reason I bring this up is that in iterable unpacking,
[a, b] = "ab"
does work (and leaves a = 'a' and b = 'b').
Also, let's not do anything for string matching. We've shown time and again that you can't beat regular expressions even though they are nearly universally loathed.
Were I making a language today, I would not make strings implicitly iterable. Rather, I would have explicit .chars(), .words(), .codepoints() iteration methods.
I think we’re all in agreement. Let’s just make sure we have a clearly argumented case here for doing things different than for unpacking assignments.
I noticed that the PEP also includes bytes and bytearray in this restriction. I can see the reasoning for str, but are we sure that restricting these two here has any benefit (in particular, bytearray)?
I think it's probably okay if the answer is just that it's common to lump these three together!
(FWIW I wrote that in the PEP. :-)
Since b'ab' == bytearray(b'ab') it would be odd to include one but not the other, and the reason to exclude str is just as valid for bytes, so I think they should all three be excluded. (Honestly I think we should probably exclude memoryview too, but I could go either way on that.)
Okay, that's good enough for me!
Regarding the actual spec:
Note that to match a sequence pattern the target must be an instance of
collections.abc.Sequence, and it cannot be any kind of string (str,bytes,bytearray). It cannot be an iterator.
I think requiring inheritance from / registering with Sequence is sort of an odd requirement, and makes it non-trivial for C extension authors to get matching behavior. It also complicates our implementation.
How do we feel about changing it slightly? Something fast and simple, like:
Note that to match a sequence pattern the target must have a
__getitem__method, and cannot inherit fromdict,str,bytes, orbytearray. It cannot be an iterator.
In C:
if (
!PySequence_Check(target)
|| PyIter_Check(target)
|| PyObject_TypeCheck(target, &PyUnicode_Type)
|| PyObject_TypeCheck(target, &PyBytes_Type)
|| PyObject_TypeCheck(target, &PyByteArray_Type)
) {
// No sequence match.
}
else {
// Do match using PyObject_GetIter(target).
}
In Python:
if (
not hasattr(target, "__getitem__")
or hasattr(target, "__next__")
or isinstance(target, (dict, str, bytes, bytearray))
):
... # No sequence match.
else:
... # Do match using iter(target).
Regarding the actual spec:
Note that to match a sequence pattern the target must be an instance of
collections.abc.Sequence, and it cannot be any kind of string (str,bytes,bytearray). It cannot be an iterator.I think requiring inheritance from / registering with
Sequenceis sort of an odd requirement, and makes it non-trivial for C extension authors to get matching behavior. It also complicates our implementation.
That's not a strong argument -- Python generally goes out of its way to make things do the right thing even if it is harder to implement.
How do we feel about changing it slightly? Something fast and simple, like:
Note that to match a sequence pattern the target must have a
__getitem__method, and cannot inherit fromdict,str,bytes, orbytearray. It cannot be an iterator.
This would still do the wrong thing for mappings that don't inherit from dict. In some places a check for keys is added. But checking for collections.abc.Sequence vs. collections.abc.Mapping is the right thing to do.
Python generally goes out of its way to make things do the right thing even if it is harder to implement... checking for
collections.abc.Sequencevs.collections.abc.Mappingis the right thing to do.
Yep, and implementation aside, the concept is easy to explain. Great points.
I've got this working in my most recent commit. I'm not sure the best way to do this, but I figure storing lazily-imported references to Sequence and Mapping on the PyInterpreterState struct is a straightforward solution. Do you mind looking it over? This is really starting to get into the belly of the beast, and I have no idea if there will be some weird interaction with subinterpreters or something:
https://github.com/python/cpython/commit/f3c513d3647360e32ec4009d29c7bed4a78c1455
Sequence seems to be loading lazily, caching, and working correctly.
On first blush that looks fine. Usually another core dev eventually fixes inefficiencies in my C code (e.g. Serhiy) so don't worry too much about it.