kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Possible future use of offset references in indexing ROOT files

Open jpivarski opened this issue 3 years ago • 0 comments

Reposting from my email to @martindurant:

This fsspec-reference-maker, which indexes byte positions in HDF5 files and serves the HDF5 files as Zarr data—such a thing could be useful for serving particle physics data. The ROOT files contain arrays at byte positions that can be pre-scanned and put in some sort of database. I did a test of this once with SkyHook, a UCSC research project (part of Ceph). Their preferred format for these byte positions was FlatBuffers.

https://github.com/diana-hep/uproot-skyhook

I'd be interested in revisiting this project (and possibly involve people who are more closely related to physics data management here). The compilations are that (a) the data are compressed with zlib, lzma, lz4, or zstandard and (b) most data have nested structures, so it would be a transform-to-Awkward like the one you recently developed. The second point might force this to wait for interpretation-aware Zarr extensions. Issue (c) might be that some ROOT files can have astonishingly small chunks, like 10's of kilobytes sometimes, which requires a very compact handling of the byte position metadata. (They're not predictably distributed, either: it has to be a list of numbers.)

Martin's response:

Yes, we may well get to supporting binary storage for a set of offset chunks - I wasn't initially thinking of the millions of keys case. JSON is nice, though, because everyone will support it. So it may well be that, as well as a bunch of scanners for various file formats, there will also be a range of storage options, possibly including zarr itself.

a) the set of compressors doesn't sound like a problem, they are all supported by zarr (https://numcodecs.readthedocs.io/en/latest/index.html#contents ) and should work on any non-python implementations too b) right: if zarr is your machine, then any structure not natively understood by zarr would need to be in an extension (v3) or informal convention (v2) c) we can face that when we get to it. For POC, JSON should work. The ROOT stack is sufficiently deep and complex that if you can expose some examples via zarr/fsspec alone, that might be attractive, and then we can face the likely bottlenecks later. I quite like the idea of using zarr to store the offsets, if it's the intended loader anyway. The fsspec implementation takes either a JSON file or a dict/mapping, so you can already provide something dict-like which generates offsets or stores them in a compact way.

PS: this conversation could be copied to the fsspec-reference-maker-repo too

Context: Awkward Array can be expressed in Zarr, but a conversion is needed. @martindurant did this here: https://github.com/martindurant/awkward_extras/tree/main/awkward_zarr

jpivarski avatar Nov 19 '20 15:11 jpivarski