zarr-python
zarr-python copied to clipboard
Big data support
Hello,
We plan to generate a super large data array (lat, lon, time) from the 56.000 Sentinel-2 tiles (per year). The longitude variable would hold about 3 million coordinates and about 2 billion chunks.
- What would be the metadata size for that ? Is there an explicit mapping to each chunk, or is it computed based on index?
- Coordinates size would be around 8 MB ?
- How fast would be a query resolved ? Is Zarr-python efficient with such large index ?
Thank you very much for your support.
Thanks for sharing. This sounds like a perfect application for Zarr.
- What would be the metadata size for that ? Is there an explicit mapping to each chunk, or is it computed based on index?
The metadata size is independent of the array size, as you can see from the spec. Arbitrarily large arrays can be stored in Zarr. This is a fundamental goal of the project.
- Coordinates size would be around 8 MB ?
Zarr has no concept of coordinates. Just groups and arrays. Perhaps you're thinking of Xarray?
- How fast would be a query resolved ? Is Zarr-python efficient with such large index ?
Can you clarify what you mean by "query"? Zarr-python supports accessing arrays via numpy-style indexing as described in the docs. The speed at which data are returned will likely depend entirely on your storage system.
about 2 billion chunks.
This has me worried. There are very few storage media that are happy with so many files / objects. How do you plan to store you data.
More details would be helpful. What are the explicit lat, lon, time dimensions and chunk sizes you have in mind?
Thanks, for your answer. I probably was not clear in my questions as I mixed concepts from zarr and from xarray.
-
We store geospatial data In a chunked multidimensional array (e.g. data) composed of three dimensions. Do you confirm that zarr can map from array index to chunk index only by function of the chunk shape ? It does not use an internal mapping index ?
-
I don't think Amazon S3 is limited in number of objects as they already host many missions (and by the way, we may increase the chunk size to reduce their number). So I suppose my concern is how xarray react when reading a 3 million records array which is used to map coordinates to index of another array.
- Do you confirm that zarr can map from array index to chunk index only by function of the chunk shape ?
Yes. If you have a 3D Zarr Array and use numpy indexing to retrieve a value by position, e.g. data[10, 20, 30], Zarr will figure out which chunks need to be read. I don't know what you mean by "internal mapping index".
2. I suppose my concern is how xarray react when reading a 3 million records array which is used to map coordinates to index of another array.
I don't understand what you mean by "map coordinates". Can you clarify? Why do you think xarray will have to read 3 million records? Can you say more about the access pattern you have in mind?
If your data are on an irregular grid (or for any other reason you need to look up values by something other than the index in the array), you'll need to use xarray, which can read from a zarr array with particular metadata, although IIRC it stores its coordinate indices in the zarr metadata i.e. JSON, so depending on how often it has to deserialise 24MB of coordinates, there might be some issues there). If your data aren't on a grid at all, I don't think zarr or xarray can help you.