awkward icon indicating copy to clipboard operation
awkward copied to clipboard

transform, but not caring about broadcasting regular dimensions

Open ivirshup opened this issue 1 year ago • 10 comments

Description of new feature

The request

I would like to do recursive iteration over multiple arrays without having to have aligned regular dimensions (especially the first).

An example

I would like to take multiple arrays, and subset their records to a common set of keys. Here's a quick example of how I would expect to do this:

from operator import and_
from functools import reduce

import awkward._v2 as ak

def intersect(layouts: list[ak.contents.Content], **kwargs):
    if all(l.is_RecordType for l in layouts):
        common_keys = list(reduce(and_, (set(l.fields) for l in layouts)))
        return [l[common_keys] for l in layouts]

This works since the arrays are of similar length:

awk_a = ak.Array([
    {"a": [1, 2, 3], "b": [1, 2]},
    {"a": [4, 5], "b": [3, 4]},
    {"a": [6], "b": [5]},
])

awk_b = ak.Array([
    {"a": [1, 2, 3]}, 
    {"a": [4, 5]},
    {"a": [6]},
])

# works with these arrays
ak.transform(intersect, awk_a, awk_b)
# (<Array [{a: [1, 2, 3]}, {...}, {a: [6]}] type='3 * {a: var * int64}'>,
#  <Array [{a: [1, 2, 3]}, {...}, {a: [6]}] type='3 * {a: var * int64}'>)

This does not work if the first dimension for the arrays is not the same size.

awk_c = ak.Array([
    {"a": [1, 2, 3], "b": [1, 2]},
    {"a": [4, 5], "b": [3, 4]},
    {"a": [6], "b": [5]},
])

awk_d = ak.Array([
    {"a": [4, 5]},
    {"a": [6]},
])

ak.transform(intersect, awk_c, awk_d)
ValueError: while calling (from /var/folders/bd/43q20k0n6z15tdfzxvd22r7c0000gn/T/ipykernel_76685/4019137656.py, line 13)
...
Error details: cannot broadcast RegularArray of size 2 with RegularArray of size 3 in ak._v2.transform

ivirshup avatar Sep 05 '22 16:09 ivirshup

I'm following what it is you're trying to do from the other thread. But you don't want to use this tool (ak.transform) here: the fundamental idea is that it broadcasts the arrays so that element-by-element operations are possible, and you don't want to do an element-by-element operation. You do want to walk down both array trees (awk_a and awk_b), but it's the fields of the RecordArrays you'll be aligning, i.e. the columns, not the rows.

Maybe the easier way to do this is to do conventional recursion (as discussed in the other thread, with handlers for every Content subclass).

jpivarski avatar Sep 05 '22 18:09 jpivarski

@ivirshup could you elaborate on what your constraints are here? Do you need the dimensions up to the record content to match? (i.e. can you have var * {'x': ...} and var * var * {'x': ...}?) If not, you can just take the intersection of all of their fields with set.intersection(*[set(array.fields) for array in (awk_a, awk_b, ...)])

agoose77 avatar Sep 05 '22 18:09 agoose77

@jpivarski

I'm not so sure it's different. If I can broadcast when the second dimension has a variable length, why should the first dimension have to have the same length?

This raised another question for me. I could wrap each array in an outer array of length one containing a variable length array. But this throws an error about not being able to handle nested lists. In some cases this could probably be flattened, then unflattened, but I'm not sure I understand why only one variable length dimension would be allowed.

More practically, I do want to be recursing across multiple arrays, but I don't think we had an example of this in a previous discussion. I do want the logic that says recursing into a Record and entries of a list are different and should be an error, but don't necessarily want to broadcast the values.


@agoose77, I would like the dimensions to be matched up, which is why I've been thinking of it in terms of the transformation.

This is mainly about the discussion:

  • https://github.com/scikit-hep/awkward/discussions/1647

ivirshup avatar Sep 05 '22 19:09 ivirshup

In this case, it's complaining because the outer length is different, but it would complain at every level about every list in awk_a that does not match the length of the corresponding list in awk_b. That synchronization of list lengths is intrinsic to what ak.transform does, having an option that opts out of it would be like a new function.

We haven't had a need to do this type of recursion yet, so it's too early to try to make it into a generalized function, similar to (or accessed through an option within) ak.transform. That came from generalizing several similar use-cases, and this new one of not broadcasting is the only of its kind. We should start by writing a recursive function that does just this one thing. Then if we see similar cases, see what's similar about them and abstract from there.

jpivarski avatar Sep 05 '22 19:09 jpivarski

@ivirshup so, same dimensions, but we don't care about the sizes of those dimensions? This sounds like a custom recursion, as Jim suggests :)

agoose77 avatar Sep 05 '22 20:09 agoose77

For this case, I think I just don't care about the size of the first dimension. I would care about the sizes of all the following dimensions.

Similarly, if I wanted to concatenate some awkward arrays which had a regular second dimension along that dimension, I wouldn't care what the shape of the second dimension was for those arrays. I would care that all the other dimensions matched though.

I'm not very into the idea of implementing this myself and outside awkward array since I suspect there are edge cases I'm not expecting (like #1672).

ivirshup avatar Sep 05 '22 20:09 ivirshup

OK. It sounds like you're trying to predict the outcome of ak.concatenate before applying it? I think Jim's suggestion that we fix these bugs so that you don't have to deal with them is the best bet. I'm sure that we can make a pre-release once these fixes are in to enable you to keep moving forward.

agoose77 avatar Sep 05 '22 21:09 agoose77

I'm not very into the idea of implementing this myself

I wasn't suggesting that; I think we'll be able to help you with this function (if it's needed).

Similarly, if I wanted to concatenate some awkward arrays which had a regular second dimension along that dimension, I wouldn't care what the shape of the second dimension was for those arrays. I would care that all the other dimensions matched though.

Broadcasting will complain if variable-length lists are not all equal to each other. If I'm understanding the purpose of this function (to avoid unions when concatenating), you might want the regular dimensions to match, but not every list of the irregular dimensions. In fact, if the first dimension (length) doesn't match, it's not even possible to compare all the lengths of lists in two arrays (ak.num(a) == ak.num(b) is not possible to check if len(a) != len(b)).


Thinking back to the larger problem of wanting to give all records the same set of fields so that they concatenate without unions, what about a procedure like the following?

  1. Get the full set of fields with set(ak.fields(a) + ak.fields(b) + ...).
  2. Recurse over each array, a, b, ..., independently and ensure that each one has the full set of fields.
  3. To add a field to a RecordArray, associate the field name with a new IndexedOptionArray of EmptyArray (i.e. type is optional[unknown]). The index of the IndexedOptionArray has the same length as the RecordArray and is filled with -1 (i.e. all values are None).

Step 2 can be done with ak.transform, with only one array as the argument. It can ignore all types except for RecordArray. Upon finding a RecordArray, it can return a new RecordArray with missing fields added as None (step 3).

Step 1 is only searching down to the first level of records—if there are records inside of records, then this whole procedure has to recurse (calling ak.transform from within the transformation function). That complicates things, but it doesn't require more knowledge of Awkward corner-cases: even the nested procedure only needs to recognize RecordArray.

jpivarski avatar Sep 05 '22 21:09 jpivarski

Broadcasting will complain if variable-length

This is starting to make sense to me. I thing what I was going for was really broadcasting the types of the dimensions, not the values. However, I'm still a little unsure which "kind of types" I would want to be looking at.

Thinking back to the larger problem of wanting to give all records the same set of fields so that they concatenate without unions

I am aiming to be able to do both the intersection and union of fields. But intersection is generally easier.

what about a procedure like the following?

I'm not sure this would work since dimension matching isn't being accounted for. For instance, this procedure could take n * var * {a: int64} and m * var * var * {a: int64, b: string} and give n * var * {a: int64, b: ?string}, m * var * var * {a: int64, b: ?string}. But concatenating these arrays would still form a union because the mismatched var dimensions.

Also, I think these functions should satisfy the record handling, just not the dimension handling. Would be called like ak.concatenate(union_arrays(arrays))

def union_records(arrays):
    fields = reduce(or_, (set(a.fields) for a in arrays))
    out_arrays = []
    for a in arrays:
        for field in fields.difference(a.fields):
            a = ak.with_field(a, None, field)
        out_arrays.append(a)
    return out_arrays

def intersect_records(arrays):
    fields = list(reduce(and_, (set(a.fields) for a in arrays)))
    return [a[fields] for a in arrays]

ivirshup avatar Sep 06 '22 10:09 ivirshup

This is what I meant by the procedure. The key thing is that we're applying the transformation to each array at a time. They're not getting broadcasted, but they're getting information about all the arrays because we pass it into the transformation function.

Here's an a and b that would concatenate to union type.

>>> import awkward._v2 as ak
>>> a = ak.Array([{"a": 1}, {"a": 2}])
>>> b = ak.Array([{"b": 1.1}, {"b": 2.2}])

For the sake of argument, I'm going to take a union of their fields (the final concatenation will be an outer join).

>>> fields = ak.fields(a) + ak.fields(b)
>>> fields
['a', 'b']

Here's a transformation function that adds an empty content (all None) for each field name that's not already in the RecordArray. It is only applied to one array, but information about the other arrays comes in through the fields variable (closed-over in this function, but it could be passed in as a context).

def add_fields(layout, **kwargs):
    if layout.is_RecordType:
        asdict = dict(zip(layout.fields, layout.contents))
        for field in fields:
            if field not in asdict:
                asdict[field] = ak.contents.IndexedOptionArray(
                    ak.index.Index64(np.full(len(layout), -1, np.int64)), ak.contents.EmptyArray()
                )
        return ak.contents.RecordArray(asdict.values(), asdict.keys(), length=len(layout))

Now we apply it individually to both arrays and after concatenation, we get option-type fields for the ones that aren't in all inputs (because X type-unified with option[unknown] is option[X]), but not a union of records.

>>> a2 = ak.transform(add_fields, a)
>>> b2 = ak.transform(add_fields, b)
>>> a2
<Array [{a: 1, b: None}, {a: 2, ...}] type='2 * {a: int64, b: ?unknown}'>
>>> b2
<Array [{b: 1.1, a: None}, {b: 2.2, ...}] type='2 * {b: float64, a: ?unknown}'>
>>> ak.concatenate([a2, b2]).show(type=True)
type: 4 * {
    a: ?int64,
    b: ?float64
}
[{a: 1, b: None},
 {a: 2, b: None},
 {a: None, b: 1.1},
 {a: None, b: 2.2}]

This transformation goes down through all layers, even if there are multiple nested lists and option-types, until it finds a RecordArray. It does not continue through that RecordArray to any RecordArrays hidden within it. That's because the transformation function returns an array-type at the level of the RecordArray. (Returning an array-type from a transoformation function is how you say, "stop descending through the tree and use this as output.")

>>> c = ak.Array([[[None, {"a": 1}]]])
>>> c2 = ak.transform(add_fields, c)
>>> c2.show(type=True)
type: 1 * var * var * ?{
    a: int64,
    b: ?unknown
}
[[[None, {a: 1, b: None}]]]

So that's why this gets more complicated if you want to continue the procedure into nested records, because you'd have to prepare fields at each level before processing each array, and then launch a deeper ak.transform inside the transformation function. It's do-able; we have some functions like that. I just don't know how deeply you want/need this to go.

jpivarski avatar Sep 06 '22 17:09 jpivarski

I believe this is closed by #2365!

agoose77 avatar May 05 '23 15:05 agoose77