Connecting with Awkward Array
Via @jpivarski (and @martindurant) I learned about the Awkward Array project. Development has been mainly driven by high energy physics use cases, but the need for variable-length data structures and other "awkard" structures is common to other domains, and work in Awkward Array could inform if/how Zarr might support some of these features. A recent update from @jpivarski:
I've finished a lot of development on my side, and Awkward is ready to use (front page; Doxygen C++ and Sphinx Python documentation is done; tutorials are not). I remember that the interaction between regular-sized arrays and variable-sized arrays was important for your data, so RegularArrays (C++, Python) are worth taking a look at.
I'd also like to know if Zarr is taking a more columnar approach to ragged arrays. Even if it's not, I could write a C++ function to de-interlace list sizes from list contents, which could then be exposed to the Python layer for Zarr → Awkward for analysis. (There's an awkward1._io extension module for these sorts of things, including some special cases for ROOT.)
Development has been mainly driven by high energy physics use cases
The original awkward already showed how you could write loopy custom code and run at C speeds over deeply nested structures of lists and maps. Such "json-like" data is very different from the usual N-D arrays we usually think of, but it is perfectly possible that you could have an N-D array of such strucs, or some other combination. To point is that each leaf node of the struct has its data stored in (some chunks of) homogenous arrays, and the nested structure is defined by corresponding arrays of offsets.
IIUC they are giving a SciPy talk this year. As everything has gone digital, it should be on YouTube pretty quickly. There's then a Q&A session after the videos go up where one can ask the speakers questions.
That's true—I've recorded it and everything. I think the talks go live on July 5 and there's a moderated discussion on July 7 at 2:30‒3:45pm U.S. Central time (schedule; title is "Awkward Array: Manipulating JSON-like Data with NumPy-like Idioms").
I think you'll be there, too, right? If so, see you then!
Looking forward to it 🙂