awkward Iterating over Records and `ak.from

Currently, ak.Record.__iter__ returns an iterator over its field names, which may be what a user expects if a Record were a Mapping/MutableMapping like a dict, but it's not. It's probably for the best that it's not, because we don't want keys, values, items methods, particularly since we're trying to consolidate the nomenclature of field names as "fields", rather than "keys". And we especially don't want the non-vectorized __eq__ and __ne__ that Mapping has.

To throw a little more confusion in, ak.Record.__len__ always returns 1 and not the number of fields (the length of what it iterates over). That's a weird enough choice to be considered a bug.

Since v2 is coming up and we can make little backward-incompatible changes like this, perhaps we should just ensure that v2 ak.Record has no __iter__ and __len__ methods. If they're really needed for something, we can be driven by the use-case as to whether ak.Record iteration should be over field names, field-value pairs, or something else. At the very least, __len__ should return the length of the __iter__ iterable.

Related to this is how ak.Record is treated by ak.from_iter: it naively iterates, which means it gets the field names:

>>> import awkward as ak
>>> ak._v2.from_iter(ak._v2.Record({"a": 1, "b": 2}))
<Array ['a', 'b'] type='2 * string'>
>>> ak._v2.from_iter([ak._v2.Record({"a": 1, "b": 2})])
<Array [['a', 'b']] type='1 * var * string'>

Although iteration is a slow way to build an array and coming from a collection of ak.Record is suggestive that it's being used when a faster way is possible, the above is almost certainly not what the user had in mind. Changing v2 ak.from_iter to recognize ak.Record and build an equivalent thing is an improvement that would be independent of fixing the ak.Record.__iter__ and __len__ behavior, but both should be done.

Edit: also check ArrayBuilder.append; make sure it's consistent with ak.from_iter (it might be how ak.from_iter is implemented...)

Aug 10 '22 15:08 jpivarski

I am inclined to agree that we should not push the record-as-mapping protocol here. We'd start getting into the weeds with how NumPy defines e.g. __contains__ and how a pure-Python mapping should etc. So, :+1: from me on making Record less clever.

Aug 11 '22 12:08 agoose77

The way I read this, this issue contained 2 things: disabling the iteration over records (done now) and fixing the way arrays are built from collections of records while iterating.

I would suggest that we reopen as the second one is not fixed.

Sep 26 '22 09:09 chernals

@chernals you're correct; this issue did raise two closely related bugs / behavioural problems, and #1725 itself only addressed one of them. However, the building of records during iteration has also been fixed by #1721, which replaced v1 with our v2 submodule. The PR that closed this issue landed after #1721, so it's fixed in main.

There is an argument to be made for backporting this fix (and the separate array-of-records fix in #1725) to the v2 submodule in 1.10.x, but I don't know yet what our criteria are for doing this. In fact, your reply is a good opportunity for @jpivarski to clarify that for me :)

@chernals could you clarify the version of Awkward that you are using? Are you using the v2 submodule from our latest minor 1.10 release?

Sep 26 '22 09:09 agoose77

This change was sufficient to fix both issues in v2 (this is from main/2.0.0rc1):

>>> import awkward as ak   # which is to say, v2
>>> ak.from_iter(ak.Record({"a": 1, "b": 2}))
<Record {a: 1, b: 2} type='{a: int64, b: int64}'>
>>> ak.from_iter([ak.Record({"a": 1, "b": 2})])
<Array [{a: 1, b: 2}] type='1 * {a: int64, b: int64}'>
>>> ak.from_iter([[ak.Record({"a": 1, "b": 2})]])
<Array [[{a: 1, b: 2}]] type='1 * var * {a: int64, b: int64}'>

The question, then, is whether it should be or can be backported into v1. Our stance is that we're only making essential bug-fixes in v1 now. A change in behavior like this might be considered a bug-fix (though the iteration over field names had once been intentional—it just had unforeseen consequences), but it's "not essential" in the sense that there's a work-around.

In fact, we've been making very few changes to v1 in practice for about a year now. (It would be interesting to make a plot of how often the v1 directories were changed in git relative to the v2 directories.)

Sep 26 '22 13:09 jpivarski

Thanks for confirming that this is all available in v2 RC, I'll have a look.

Sep 28 '22 11:09 chernals

We don't yet have an RC out for v2. If you've used pipx and have docker installed, you can clone main and run

# Change XX for e.g. 39 (Python 3.9)
pipx run cibuildwheel . --only cpXX-manylinux_x86_64

Here, cpXX-manylinux_x86_64 is the wheel tag to build.

Sep 28 '22 11:09 agoose77

awkward
awkward copied to clipboard

Iterating over Records and `ak.from_iter`

awkward awkward copied to clipboard

Iterating over Records and `ak.from_iter`

awkward
awkward copied to clipboard