[WIP] Add support for `ragged` arrays
This adds client and backend support for reading/writing irregular arrays using the the ragged package. As ragged is more or less a wrapper around awkward, this PR reuses, or adds similar implementations from that structure family (e.g. serialization).
Implements #801.
Checklist
- [ ] Add a Changelog entry
- [x] Add the ticket number which this PR closes to the comment section
- [x] Writing ragged data from client (file storage)
- Reading ragged data to client (file storage)
- [x] in full
- [ ] sliced
- [ ] from block
- [ ] from block, sliced
- [ ] with variable chunks
- [ ] Writing
[None]shaped data from Bluesky/TiledWriter into SQL storage - [x] Reading data from
RaggedAdapterreturned bySQLAdapter(SQL storage) - Reading ragged data to client (SQL storage)
- [ ] in full
- [ ] sliced
- [ ] from block
- [ ] from block, sliced
- [ ] with variable chunks
- Serialization
- [x] JSON
- [x] Arrow
- [x] Parquet
- [ ] others?
Awesome!
It looks like you found all the modules that need to be touched to add this.
The aspect that will need the most careful thought is the structure description and the HTTP APIs. These are designed to be used not only from the built-in Python client, but also from curl with tools like jq, browser-based applications, maybe Julia or Rust someday....
The Awkward form is quite complex. I suspect that only Python and C++ based clients, with access to awkward / awkward-cpp libraries, will be able to parse the form and engage with Tiled's Awkward structures in detail. (Unless, that is, IRIS-HEP builds Awkward libraries in other languages.) Clients without knowledge of Awkward can still get the data—exporting it to JSON, for example—but they probably cannot introspect or slice it in sophisticated ways.
If we were willing to similarly restrict ragged to clients with access to an awkward implementation, we wouldn't even really need to add a new structure family. We could implement it fully client-side, as a wrapper of the awkward client. But I see advantages in using the comparative simplicity of ragged to make it more accessible to simple clients.
This form construct is more flexible than ragged requires:
{'class': 'ListOffsetArray',
'offsets': 'i64',
'content': {'class': 'NumpyArray',
'primitive': 'int64',
'inner_shape': [],
'parameters': {},
'form_key': 'node1'},
'parameters': {},
'form_key': 'node0'}
A ragged form is always composed of one numpy "content" array and some number of "offset" arrays—full stop. It can be described thus (from #801):
class RaggedStructure(ArrayStructure):
shape: Tuple[None | int, ...] # override base class which has this as Tuple[int, ...]
I'm not sure whether ragged always puts offset arrays in int64 dtype. If other uint types may be needed, then we will need a supplemental offset_datatype, similar to the supplemental coord_datatype in sparse structures.
https://github.com/bluesky/tiled/blob/f6a9509f213824a32a385f3277d320bd85fd2370/tiled/structures/sparse.py#L19-L23
Although reusing the awkward form keeps things simple assuming your client already consumes awkward I think having a custom, much more constrained structure JSON, is worthwhile, to make ragged arrays a more portable and accessible concept.
Currently using flattened numpy streams over the wire, and for file storage, which I think is working well for full arrays.
For fetching data, I need to rethink either slicing or server-to-client serialization of sliced data, partly because ragged.reshape is not yet implemented, mostly because the new shape must be known by the client. I'm also testing determining the new shape by applying the slice to the known structure shape property.
Likely a solution could be to return a JSON blob of form
{
"sliced_shape": [ <int>, <int|null>, ... ],
"sliced_offsets": [ [ <int>, ... ], ... ],
"stream": <flattened octet-stream>
}
For short-term, could also just use the to_json/from_json which would keep structure, but sacrificing bandwidth.