awkward
awkward copied to clipboard
Iterating over Records and `ak.from_iter`
Currently, ak.Record.__iter__
returns an iterator over its field names, which may be what a user expects if a Record were a Mapping/MutableMapping like a dict, but it's not. It's probably for the best that it's not, because we don't want keys
, values
, items
methods, particularly since we're trying to consolidate the nomenclature of field names as "fields", rather than "keys". And we especially don't want the non-vectorized __eq__
and __ne__
that Mapping has.
To throw a little more confusion in, ak.Record.__len__
always returns 1 and not the number of fields (the length of what it iterates over). That's a weird enough choice to be considered a bug.
Since v2 is coming up and we can make little backward-incompatible changes like this, perhaps we should just ensure that v2 ak.Record
has no __iter__
and __len__
methods. If they're really needed for something, we can be driven by the use-case as to whether ak.Record
iteration should be over field names, field-value pairs, or something else. At the very least, __len__
should return the length of the __iter__
iterable.
Related to this is how ak.Record
is treated by ak.from_iter
: it naively iterates, which means it gets the field names:
>>> import awkward as ak
>>> ak._v2.from_iter(ak._v2.Record({"a": 1, "b": 2}))
<Array ['a', 'b'] type='2 * string'>
>>> ak._v2.from_iter([ak._v2.Record({"a": 1, "b": 2})])
<Array [['a', 'b']] type='1 * var * string'>
Although iteration is a slow way to build an array and coming from a collection of ak.Record
is suggestive that it's being used when a faster way is possible, the above is almost certainly not what the user had in mind. Changing v2 ak.from_iter
to recognize ak.Record
and build an equivalent thing is an improvement that would be independent of fixing the ak.Record.__iter__
and __len__
behavior, but both should be done.
Edit: also check ArrayBuilder.append
; make sure it's consistent with ak.from_iter
(it might be how ak.from_iter
is implemented...)
I am inclined to agree that we should not push the record-as-mapping protocol here. We'd start getting into the weeds with how NumPy defines e.g. __contains__
and how a pure-Python mapping should etc. So, :+1: from me on making Record
less clever.
The way I read this, this issue contained 2 things: disabling the iteration over records (done now) and fixing the way arrays are built from collections of records while iterating.
I would suggest that we reopen as the second one is not fixed.
@chernals you're correct; this issue did raise two closely related bugs / behavioural problems, and #1725 itself only addressed one of them. However, the building of records during iteration has also been fixed by #1721, which replaced v1 with our v2 submodule. The PR that closed this issue landed after #1721, so it's fixed in main.
There is an argument to be made for backporting this fix (and the separate array-of-records fix in #1725) to the v2 submodule in 1.10.x, but I don't know yet what our criteria are for doing this. In fact, your reply is a good opportunity for @jpivarski to clarify that for me :)
@chernals could you clarify the version of Awkward that you are using? Are you using the v2 submodule from our latest minor 1.10 release?
This change was sufficient to fix both issues in v2 (this is from main
/2.0.0rc1):
>>> import awkward as ak # which is to say, v2
>>> ak.from_iter(ak.Record({"a": 1, "b": 2}))
<Record {a: 1, b: 2} type='{a: int64, b: int64}'>
>>> ak.from_iter([ak.Record({"a": 1, "b": 2})])
<Array [{a: 1, b: 2}] type='1 * {a: int64, b: int64}'>
>>> ak.from_iter([[ak.Record({"a": 1, "b": 2})]])
<Array [[{a: 1, b: 2}]] type='1 * var * {a: int64, b: int64}'>
The question, then, is whether it should be or can be backported into v1. Our stance is that we're only making essential bug-fixes in v1 now. A change in behavior like this might be considered a bug-fix (though the iteration over field names had once been intentional—it just had unforeseen consequences), but it's "not essential" in the sense that there's a work-around.
In fact, we've been making very few changes to v1 in practice for about a year now. (It would be interesting to make a plot of how often the v1 directories were changed in git relative to the v2 directories.)
Thanks for confirming that this is all available in v2 RC, I'll have a look.
We don't yet have an RC out for v2. If you've used pipx
and have docker
installed, you can clone main
and run
# Change XX for e.g. 39 (Python 3.9)
pipx run cibuildwheel . --only cpXX-manylinux_x86_64
Here, cpXX-manylinux_x86_64
is the wheel tag to build.