dask-awkward Using `.` as a delimiter in field names of `ak.Records` leads to wrong report of necessary columns

Using . within field names of ak.Records is a valid thing to do. . is also commonly used in parquet's schemas or ROOT TBranch names, which are translated directly into field names of RecordArrays.

Currently, dask-awkward reports the same necessary columns for the following two cases:

# nested record
ak_array = ak.zip({"pt": ak.zip({"foo": [10, 20, 30, 40]})})
dak_array = dak.from_awkward(ak_array, 1)
dak.necessary_columns(dak_array.pt.foo)
# >> {'from-awkward-9982974575212ac0cbf053decf1d9e0e': frozenset({'pt.foo'})}

# field name with "."
ak_array = ak.zip({"pt.foo": [10, 20, 30, 40]})
dak_array = dak.from_awkward(ak_array, 1)
dak.necessary_columns(dak_array["pt.foo"])
# >> {'from-awkward-6750f2c89e6822a055048307a7430a02': frozenset({'pt.foo'})}

Maybe alternatively the frozenset should contain ("pt", "foo") in the first case ("nested record") and ("pt.foo",) in the second case ("field name with ."), so these cases are ambiguous?

Nov 20 '24 16:11 pfackeldey

Yep, it's a problem. The fields should probably always be tuples of strings everywhere. @jpivarski suggested using "'pt.foo'" (or backticks, which I can't do here) where the API allows for normalised string input.

The complexity comes in how we can pass these to the backend column selection, and if/where "*" is allowed. For example, pyarrow does accept and expect "." delimited strings. All of this complexity needs to be encoded in the encoding of whatever the backend gives to form-keys and back again (at load selection time). We care abot parquet, root and JSON, which each have quirks.

Nov 20 '24 16:11 martindurant

(it's worth mentioning that most parquet frameworks prevent field names with various special characters, but this is not actually part of the spec)

Nov 20 '24 16:11 martindurant

A good convention could be to set form keys to a tuple representing a path from root to each node (easy to set up in a recursive algorithm), with strings representing RecordArray field names, integers for tuple and UnionArray slots, and None for all other nesting (lists, option-types, indexed arrays). The path wouldn't encode the specifics of which type of nodes are at each level—you have the form for that—but it would indicate which branch you took. Form keys have to be strings, so you could take the repr, which can be reconstituted with ast.literal_eval.

>>> tuple_of_things = ("fieldname", 2, None, "another")
>>> form_key = repr(tuple_of_things)
>>> form_key
"('fieldname', 2, None, 'another')"
>>> tuple_again = ast.literal_eval(form_key)
>>> tuple_again
('fieldname', 2, None, 'another')

If you only cared about uniqueness, there's already a utility for that in Awkward: Content.form_with_key. But this came up because @pfackeldey wants to go from a specific node to the nested field names that lead to that node. If the form keys were only unique and the form is available, it would be possible to find that path, but only by doing a recursive tree-search every time. It's nice to have this path encoded in the form key itself so that you don't have to do a search.

Speaking of which, that might be better as an Awkward utility, alongside Content.form_with_key.

Nov 20 '24 17:11 jpivarski

Maybe this would be useful? https://github.com/scikit-hep/awkward/pull/3311

If so, it needs some tests.

Nov 20 '24 17:11 jpivarski