VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Usage example: DMR++ in requestor pays bucket

Open jsignell opened this issue 2 months ago • 3 comments

I just did an experiment using DMR++ files stored in a requestor pays bucket.

At a high level it took 40sec to lazily open one subgroup in one of the hdf5 files using virtualizarr and it took ~3sec to open the full dataset described by one of the dmr++ files (I know that's not apples to apples, but ¯\(ツ)/¯ ). I wanted to drop a link over here in case it serves as a useful jumping off point for someone else: https://gist.github.com/jsignell/9e5683cd114fb34d4a0b92fc85ffc951

Running the notebook requires you to have AWS creds as env vars. If you have access it does run on https://hub.openveda.cloud

jsignell avatar Sep 30 '25 16:09 jsignell

At a high level it took 40sec to lazily open one subgroup in one of the hdf5 files using virtualizarr and it took ~3sec to open the full dataset described by one of the dmr++ files (I know that's not apples to apples, but ¯(ツ)/¯ ).

This is expected right? The DMR++ file effectively already has all the metadata pre-aggregated for us, whereas parsing the hdf5 currently requires lots of small GETs.

TomNicholas avatar Sep 30 '25 16:09 TomNicholas

Oh yeah it's expected! Just trying to summarize the scale of the difference in case it's useful for people.

jsignell avatar Sep 30 '25 17:09 jsignell

The other factor that should be documented regarding starting from DMR++ / Kerchunk is the impact on the last_updated_at_checksum since you're starting from references that do not store information about their own validity. It's a trade-off between performance and guaranteed accuracy.

maxrjones avatar Oct 03 '25 17:10 maxrjones