Better documentation
This library is supposed to have an API so small it's practically non-existent (everything being done through xarray instead), but we still probably need documentation that's more than just the Readme.
For example a detailed explanation of when to use combine_nested vs combine_by_coords, or how to use preprocess to order the datasets using the name of the file.
A good start would just be to move each section in the readme into a separate page of a read the docs build.
I'm thinking about how the documentation of this package should be structured. It's tricky because there are multiple concepts, it requires understanding xarray's model pretty well, and not everything works yet.
We need narrative documentation explaining:
- [x] What Kerchunk reads from files
- [x] Storage of these byte ranges in a chunk manifest
- [x] Wrapping of these in a ManifestArray
- [x] Wrapping of those inside xarray objects
- [x] Concatenation via xarray without indexes (#52)
- [x] Concatenation via xarray with indexes (#52)
- [ ] Doing all the above in one go using open_mfdataset
- [x] Writing out to Kerchunk
- [x] Reading using fsspec
- [x] Eventually writing out to a Zarr manifest
This can be adapted from the existing content in the readme.
We also some example-based docs that can be treated more as recipes. Those should cover:
- [ ] Concatenation along one dimension
- [ ] Concatenation along multiple dimensions
- [ ] Concatenating staggered grids?
- [ ] Concatenating in order determined by information in file names/attributes
- [ ] Reading the written references using fsspec
- [ ] Adjusting metadata in references (e.g. altering paths to be S3 urls rather than local paths if you move the archival files to the cloud)
We also need some developer docs:
- [x] API reference, even for a lot of internals
- [x] Relation to Kerchunk, and motivation for this being a separate project to Kerchunk
- [ ] Discussion about Kerchunk file format
- [ ] Discussion of virtual concatenation in Zarr as an intended way to eventually concatenate arrays with different codecs
- [ ] Discussion of the status of variable-length chunking
Some of those might be better placed under an FAQ page.
Would add somewhere that "Only local and s3 file protocols are currently supported with fsspec".
Getting the NotImplementedError cought me by suprise. I was hoping the use of fsspec would guarantee all implementations
@okz which implementation were you trying to use?
@TomNicholas It was azure, az or abfs. The exception was so specific I have assumed there is a real requirement for s3, but looking at :
open_virtual_dataset_from_v3_store I couldn't see anything specific for s3, are you aware of any?
Sorry to sneak this one in here, but was planning to use VirtualZarr, to concatanate "many" files, uploaded from instruments in real time. The "Manifests and kerchunk sidecar " ideas would have simplified concurrent read/write issues. My assumption is, once reading kerchunk and zarrs as virtual datasets (#118 and #63 ) this is possible. Do you see any obvious issues on this approach?
Sorry @okz I totally forgot to reply to you.
A lot has changed since your comment. Your approach makes sense, though now we have Icechunk to replace the kerchunk format. Sounds like you might need to wait for Icechunk to support Azure first though (see https://github.com/earth-mover/icechunk/issues/602). Feel free to raise other issues.
(Going back to the purpose of this issue) the docs should have a separation between conceptual content and just "how do I use this" content.