kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Feature/simple cli for chunking local or remote NetCDF files

Open steph-ben opened this issue 2 years ago • 24 comments

Hello, thanks for this lib !

I ended up rewriting several times the scan and consolidate parts, from your tutorial. I thought this small cli would be of interest, when working outside notebook ! Happy to share your view on this.

Usage example :

$ kerchunk-nc -i s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc -i s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc
INFO:kercli:Scanning s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Scanning s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Data loaded from json/mydataset : 2 found
INFO:kercli:Consolidating to zarr/mydataset.zarr ...

Will result in

$ tree json/ zarr/
json/
└── mydataset
    └── s3:
        └── era5-pds
            └── 2020
                ├── 01
                │   └── data
                │       └── air_pressure_at_mean_sea_level.json
                └── 02
                    └── data
                        └── air_pressure_at_mean_sea_level.json
zarr/
└── mydataset.zarr

Help looks like :

$ kerchunk-nc --help
Usage: kerchunk-nc [OPTIONS]

  Cli for ker-chunking local or remote NetCDF files

Options:
  --name TEXT           Dataset name  [default: mydataset]
  -i, --input TEXT      Input file url, readable by fsspec  [required]
  --input-format TEXT
  --input-fs-args TEXT  Arguments that will be passed to fsspec.open()
                        [default: {'anon': True}]
  --json-dir TEXT       Where to store scan output as json
  --zarr-output TEXT    Output of fully merged kerchunk zarr file
  --force-scan          Force scanning input file, even if json file exists
  -v, --verbose
  --help                Show this message and exit.

steph-ben avatar Mar 15 '23 09:03 steph-ben

I wonder, are you aware of pangeo-forge? It provides a recipe-runner abstraction for reading various xarray supported file types and converting them for storage. That conversion can be via kerchunk to produce JSON files like you are doing. The target is mostly for automatic running of recipes on various cloud backends, so very large datasets; but you can execute a recipe locally in a way that is probably quite similar to the CLI here. I am not saying that I am opposed to the CLI, but if pangeo-forge is simple enough to use for the same purpose, it seems better not to duplicate effort. Would you mind having a look and seeing if what is there make sense to you and that you can easily reach the same workflow.

If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).

martindurant avatar Mar 15 '23 13:03 martindurant

Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.

martindurant avatar Mar 15 '23 13:03 martindurant

I had a quick look before this PR on pangeo-forge, but it seems to me very cloud-oriented and a little bit "the-big-thing" to do what I want.

My use-case was really to tackle simple case, easy to demonstrate and to explain, where everything go well, and there is no need to write any python.

I understood pangeo-forge target was to cover all use-cases (therefore the need to write some python in receipe.py), and provide a cloud-ready CI (which is a really really great job!!!).

Possible solution to go on:

  • Continue this cli on kerchunk, allowing to cover simple use-case directly from this lib, without external libs
  • Make a simple-use-case cli at the pangeo-forge-receipe

Happy to get your view on this.

steph-ben avatar Mar 16 '23 09:03 steph-ben

Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.

Thanks I totally miss this function ! For sure will use it if we go ahead

If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).

Currently started with NetCDF file, but yes I need to cover GRIB as well

steph-ben avatar Mar 16 '23 09:03 steph-ben

Would it make sense to make this a sub command? Then there could be another sub command for combining ref json files?

abkfenris avatar Mar 17 '23 14:03 abkfenris

Would it make sense to make this a sub command?

I have no preference between subcommands and passing extra arguments, it's just a matter of style.

martindurant avatar Mar 18 '23 19:03 martindurant

Hello, what is the status of this "feature"?

NikosAlexandris avatar Oct 18 '23 20:10 NikosAlexandris

Hello, what is the status of this "feature"?

Hello, currently I don't have time to work on this. Happy if someone want to take over.

steph-ben avatar Nov 16 '23 09:11 steph-ben

Hello, what is the status of this "feature"?

Hello, currently I don't have time to work on this. Happy if someone want to take over.

I am working-out something over at https://github.com/NikosAlexandris/rekx.

NikosAlexandris avatar Nov 19 '23 18:11 NikosAlexandris

Step-by-step, I have some Very DRAFT without tests at https://github.com/NikosAlexandris/rekx/tree/main/rekx in a 'works-for-me' state. @martindurant any interest in seeing this growing ?

NikosAlexandris avatar Dec 14 '23 00:12 NikosAlexandris

any interest in seeing this growing?

I wouldn't use it personally, but it seems that some in here would, so I'd be happy to include something like this.

martindurant avatar Dec 22 '23 19:12 martindurant

I am working on it : https://github.com/NikosAlexandris/rekx#examples -- these are just a small part of what rekx can already crunch. In time I will add examples for Kerchunking massive datasets.

NikosAlexandris avatar Dec 30 '23 23:12 NikosAlexandris

@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.

martindurant avatar Jan 04 '24 16:01 martindurant

@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.

I'd appreciate some guidance on all matters about Kerchunk and, of course, I'd be grateful for suggestions to eventually make this effort meaningful outside own needs. Some examples :

  • https://nikosalexandris.github.io/rekx/how_to/kerchunk_to_json/
  • https://nikosalexandris.github.io/rekx/how_to/kerchunk_to_parquet/#verify which is a reminder to self to work further on https://github.com/fsspec/kerchunk/issues/345#issuecomment-1809384846.

Maybe we can better shape it before asking for exposure ?

ps- A larger tutorial using SARAH3 products is on its way, also thanks to the good people in the german weather service (DWD) who actually produce these data.

NikosAlexandris avatar Jan 05 '24 00:01 NikosAlexandris

@martindurant And of course, if I wasn't clear, I don't mind for whatever scenario if this goes well: integrate directly here-in or link to it. Whatever works better.

NikosAlexandris avatar Jan 05 '24 12:01 NikosAlexandris

I have a slight preference to integrate it into kerchunk, using command kerchunk if possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).

martindurant avatar Jan 12 '24 15:01 martindurant

I have a slight preference to integrate it into kerchunk, using command kerchunk if possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).

Would you prefer a rather clean Kerchunking interface (i.e. kerchunk reference, kerchunk combine and more from what is in the Kerchunk API and makes sense to expose to the command line) ? Or would you accept keeping also, in some form, some of the inspect, shapes, select/read-performance and rechunk-generator commands too ?

NikosAlexandris avatar Jan 14 '24 23:01 NikosAlexandris

would you accept keeping also

Yes, I think it's fine to have all those commands - they can be helpful shortcuts in some places.

martindurant avatar Jan 15 '24 14:01 martindurant

I am working on rekx further as it serves for my work. The idea is to bring it to a cleaner shape before integrating to Kerchunk. I feel some important bits are currently rather messy. My main concern is to achieve a clean and logical correspondence between commands (based on Typer, currently defined in https://github.com/NikosAlexandris/rekx/blob/main/rekx/cli.py) which consume CLI modules (e.g. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/inspect.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/shapes.py) which in turn consume something like an API (i.e. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/netcdf_metadata.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/diagnose.py).

Ah, and testing... of course! This front needs some love.

It would be good however to start a discussion on the integration (requirements overall, dependencies, things to do and things not to do) at some point and formalise the tasks to-do (?).

NikosAlexandris avatar Jan 18 '24 10:01 NikosAlexandris

I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.

on the integration, requirements overall, dependencies, things to do and things not to do

Since nothing exists yet, I am not too worried. Probably it's reasonable to add typer to the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.

martindurant avatar Jan 24 '24 15:01 martindurant

I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.

I hope I can make it to join.

on the integration, requirements overall, dependencies, things to do and things not to do

Since nothing exists yet, I am not too worried. Probably it's reasonable to add typer to the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.

You are right, I will try to contribute useful things while I expect to learn a lot from the interaction and the experience.

NikosAlexandris avatar Jan 24 '24 15:01 NikosAlexandris

https://discourse.pangeo.io/t/kerchunk-planning/4002/2 for the kerchunk planning thread

martindurant avatar Jan 30 '24 15:01 martindurant