kerchunk
kerchunk copied to clipboard
Feature/simple cli for chunking local or remote NetCDF files
Hello, thanks for this lib !
I ended up rewriting several times the scan and consolidate parts, from your tutorial. I thought this small cli would be of interest, when working outside notebook ! Happy to share your view on this.
Usage example :
$ kerchunk-nc -i s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc -i s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc
INFO:kercli:Scanning s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Scanning s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Data loaded from json/mydataset : 2 found
INFO:kercli:Consolidating to zarr/mydataset.zarr ...
Will result in
$ tree json/ zarr/
json/
└── mydataset
└── s3:
└── era5-pds
└── 2020
├── 01
│ └── data
│ └── air_pressure_at_mean_sea_level.json
└── 02
└── data
└── air_pressure_at_mean_sea_level.json
zarr/
└── mydataset.zarr
Help looks like :
$ kerchunk-nc --help
Usage: kerchunk-nc [OPTIONS]
Cli for ker-chunking local or remote NetCDF files
Options:
--name TEXT Dataset name [default: mydataset]
-i, --input TEXT Input file url, readable by fsspec [required]
--input-format TEXT
--input-fs-args TEXT Arguments that will be passed to fsspec.open()
[default: {'anon': True}]
--json-dir TEXT Where to store scan output as json
--zarr-output TEXT Output of fully merged kerchunk zarr file
--force-scan Force scanning input file, even if json file exists
-v, --verbose
--help Show this message and exit.
I wonder, are you aware of pangeo-forge? It provides a recipe-runner abstraction for reading various xarray supported file types and converting them for storage. That conversion can be via kerchunk to produce JSON files like you are doing. The target is mostly for automatic running of recipes on various cloud backends, so very large datasets; but you can execute a recipe locally in a way that is probably quite similar to the CLI here. I am not saying that I am opposed to the CLI, but if pangeo-forge is simple enough to use for the same purpose, it seems better not to duplicate effort. Would you mind having a look and seeing if what is there make sense to you and that you can easily reach the same workflow.
If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).
Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.
I had a quick look before this PR on pangeo-forge, but it seems to me very cloud-oriented and a little bit "the-big-thing" to do what I want.
My use-case was really to tackle simple case, easy to demonstrate and to explain, where everything go well, and there is no need to write any python.
I understood pangeo-forge target was to cover all use-cases (therefore the need to write some python in receipe.py), and provide a cloud-ready CI (which is a really really great job!!!).
Possible solution to go on:
- Continue this cli on kerchunk, allowing to cover simple use-case directly from this lib, without external libs
- Make a simple-use-case cli at the pangeo-forge-receipe
Happy to get your view on this.
Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.
Thanks I totally miss this function ! For sure will use it if we go ahead
If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).
Currently started with NetCDF file, but yes I need to cover GRIB as well
Would it make sense to make this a sub command? Then there could be another sub command for combining ref json files?
Would it make sense to make this a sub command?
I have no preference between subcommands and passing extra arguments, it's just a matter of style.
Hello, what is the status of this "feature"?
Hello, what is the status of this "feature"?
Hello, currently I don't have time to work on this. Happy if someone want to take over.
Hello, what is the status of this "feature"?
Hello, currently I don't have time to work on this. Happy if someone want to take over.
I am working-out something over at https://github.com/NikosAlexandris/rekx.
Step-by-step, I have some Very DRAFT without tests at https://github.com/NikosAlexandris/rekx/tree/main/rekx in a 'works-for-me' state. @martindurant any interest in seeing this growing ?
any interest in seeing this growing?
I wouldn't use it personally, but it seems that some in here would, so I'd be happy to include something like this.
I am working on it : https://github.com/NikosAlexandris/rekx#examples -- these are just a small part of what rekx can already crunch. In time I will add examples for Kerchunking massive datasets.
@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.
@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.
I'd appreciate some guidance on all matters about Kerchunk and, of course, I'd be grateful for suggestions to eventually make this effort meaningful outside own needs. Some examples :
- https://nikosalexandris.github.io/rekx/how_to/kerchunk_to_json/
- https://nikosalexandris.github.io/rekx/how_to/kerchunk_to_parquet/#verify which is a reminder to self to work further on https://github.com/fsspec/kerchunk/issues/345#issuecomment-1809384846.
Maybe we can better shape it before asking for exposure ?
ps- A larger tutorial using SARAH3 products is on its way, also thanks to the good people in the german weather service (DWD) who actually produce these data.
@martindurant And of course, if I wasn't clear, I don't mind for whatever scenario if this goes well: integrate directly here-in or link to it. Whatever works better.
I have a slight preference to integrate it into kerchunk, using command kerchunk if possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).
I have a slight preference to integrate it into kerchunk, using command
kerchunkif possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).
Would you prefer a rather clean Kerchunking interface (i.e. kerchunk reference, kerchunk combine and more from what is in the Kerchunk API and makes sense to expose to the command line) ? Or would you accept keeping also, in some form, some of the inspect, shapes, select/read-performance and rechunk-generator commands too ?
would you accept keeping also
Yes, I think it's fine to have all those commands - they can be helpful shortcuts in some places.
I am working on rekx further as it serves for my work. The idea is to bring it to a cleaner shape before integrating to Kerchunk. I feel some important bits are currently rather messy. My main concern is to achieve a clean and logical correspondence between commands (based on Typer, currently defined in https://github.com/NikosAlexandris/rekx/blob/main/rekx/cli.py) which consume CLI modules (e.g. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/inspect.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/shapes.py) which in turn consume something like an API (i.e. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/netcdf_metadata.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/diagnose.py).
Ah, and testing... of course! This front needs some love.
It would be good however to start a discussion on the integration (requirements overall, dependencies, things to do and things not to do) at some point and formalise the tasks to-do (?).
I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.
on the integration, requirements overall, dependencies, things to do and things not to do
Since nothing exists yet, I am not too worried. Probably it's reasonable to add typer to the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.
I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.
I hope I can make it to join.
on the integration, requirements overall, dependencies, things to do and things not to do
Since nothing exists yet, I am not too worried. Probably it's reasonable to add
typerto the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.
You are right, I will try to contribute useful things while I expect to learn a lot from the interaction and the experience.
https://discourse.pangeo.io/t/kerchunk-planning/4002/2 for the kerchunk planning thread