Usage example: DMR++ in requestor pays bucket
I just did an experiment using DMR++ files stored in a requestor pays bucket.
At a high level it took 40sec to lazily open one subgroup in one of the hdf5 files using virtualizarr and it took ~3sec to open the full dataset described by one of the dmr++ files (I know that's not apples to apples, but ¯\(ツ)/¯ ). I wanted to drop a link over here in case it serves as a useful jumping off point for someone else: https://gist.github.com/jsignell/9e5683cd114fb34d4a0b92fc85ffc951
Running the notebook requires you to have AWS creds as env vars. If you have access it does run on https://hub.openveda.cloud
At a high level it took 40sec to lazily open one subgroup in one of the hdf5 files using virtualizarr and it took ~3sec to open the full dataset described by one of the dmr++ files (I know that's not apples to apples, but ¯(ツ)/¯ ).
This is expected right? The DMR++ file effectively already has all the metadata pre-aggregated for us, whereas parsing the hdf5 currently requires lots of small GETs.
Oh yeah it's expected! Just trying to summarize the scale of the difference in case it's useful for people.
The other factor that should be documented regarding starting from DMR++ / Kerchunk is the impact on the last_updated_at_checksum since you're starting from references that do not store information about their own validity. It's a trade-off between performance and guaranteed accuracy.