kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Upcoming major improvements

Open martindurant opened this issue 3 years ago • 19 comments

Stuff that would be cool to get done and is well within out capacity. Please vote if you have favourites here!

  • [x] automagically run combine in dask tree reduction mode #255
  • [x] split to subchunk utility for uncompressed data. #251
  • [ ] concatenate from subchunks for irregular inputs and primitive sharding support
  • [x] coords generation (e.g., FITS WCS, geoTIFF bounding boxes) for flexible xarray indexes (partial: #192 )
  • [ ] parquet storage (from preffs) #277 with
    • [ ] lazy loading
    • [ ] sorted key partitioning
  • [ ] subsampling and multiscale pyramid generation
  • [ ] checksums or other UUIDs for remote files, to check for changes
  • [ ] linked zarr/numcodecs: allow for read context to be passed to codec and storage layer, so that we can, for instance, apply a different scale to a cube at each pane https://github.com/zarr-developers/zarr-python/pull/1131
  • [x] consolidate (nearly) adjoining reads in ReferenceFileSystem as is done in fsspec.parquet https://github.com/fsspec/filesystem_spec/pull/1063

martindurant avatar Aug 08 '22 13:08 martindurant

I would vote for "coords generation" especially with FITS.

nabobalis avatar Aug 17 '22 18:08 nabobalis

  1. "coords generation" for GeoTIFF
  2. handling netcdf files with different scale_factor and add_offset (not on original list, but...)
  3. parquet

rsignell-usgs avatar Aug 18 '22 16:08 rsignell-usgs

automagically run combine in dask tree reduction mode for me please!

emfdavid avatar Sep 09 '22 15:09 emfdavid

"parquet storage (from preffs)" this sounds nifty, but I'll add for what it's worth, I did discuss with @jakirkham what it would look like to use zarr for the storage of kerchunk itself :smile:

joshmoore avatar Nov 10 '22 15:11 joshmoore

I would certainly love your ideas, and the thought had certainly occurred to me.

In favour of parquet:

  • the data is essentially tabular, no benefit from chunking on higher dimensions
  • most of data data (key and embedded chunks or metadata) are str/bytes. OTOH, the keys (required for every entry) could be fixed-string
  • partitioning into unequal-sized pieces, allowing, for instance, all of the references of one variable to live together and only be loaded at need; the column min/max values in the metadata also help with this
  • in preffs, each key may appear multiple times, to indicate concatenation of subchunks. Parquet could maybe also achieve this with variable-length lists of references. I'm unconvinced that this is a good idea, but zarr doesn't have the capability.

In favour of zarr:

  • it is already a requirement. Parquet also brings in pandas as a requirement.
  • the parquet references would be stored in memory as a dataframe (right?), which has significantly slower indexing compared to dicts from JSON. Raw numpy arrays from zarr might be more efficient. It is worth noting that keys ought to be ASCII

martindurant avatar Nov 10 '22 16:11 martindurant

Moving to parquet or zarr sounds like a great idea. I am having some success with HRRR. I will try to share results next week.

emfdavid avatar Nov 10 '22 17:11 emfdavid

@emfdavid can you give us an update here? I'm hitting memory issues trying to generate/use kerchunk on the NWM 1km gridded CONUS dataset from https://registry.opendata.aws/nwm-archive/. Creating/loading the consolidated JSON for just 10 years of this 40 year dataset takes 16GB of RAM.

rsignell-usgs avatar Dec 14 '22 15:12 rsignell-usgs

@rsignell-usgs , are you using tree reduction? Since there is a lot of redundancy between the individual files, that should need less peak memory.

martindurant avatar Dec 14 '22 15:12 martindurant

@martindurant , yes, there are 100,000+ individual JSONs that cover the 40 year period. I use 40 workers that each consolidate a single year. Access to the individual single year JSON (which takes 1.5GB memory) is shown here: https://nbviewer.org/gist/26f42b8556bf4cab5df81fc924342d5d

I don't have enough memory on the ESIP qhub to combine the 40 JSONs into a single JSON. :(

rsignell-usgs avatar Dec 14 '22 15:12 rsignell-usgs

You might still be able to tree further: try combining in batches of 5 or 8, and then combining those?

martindurant avatar Dec 14 '22 15:12 martindurant

After creating the filesystem for one year, I see 1.2GB in use. I'll look into it.

I am indeed working on the parquet backend, which should give better memory footprint per reference set; but strings are still strings, so all those paths add up once in memory unless templates are only applied at access time. Hm.

However, it may be possible, instead, to make the combine process not need to load all the reference sets up front.

martindurant avatar Dec 14 '22 16:12 martindurant

Hi Rich I am travelling this week. Martin is ahead of me anyway though. I am working on open sourcing the HRRR aggregation I built but I was over optimistic about doing it while travelling for work.

I am doing a three step tree process:

  1. scan_grib to extract the metadata and write raw objects one to one with the original forecast hour grib files
  2. Daily multizarr aggregations from each individual forecast hour
  3. Monthly mutlizarr aggregations from each of the daily aggregations
  4. All time multizarr aggregations from the monthly aggregations

I do this on a per forecast horizon basis, so I end up with aggregations for 0,1,2...17 &18 hour horizons. Then I get 6 hour aggregations out to 48 hour horizon because HRRR only runs a full 48 hour model every six hours: 19-24, 25-30, 31-36, 37-42, 43-48.

I am thinking I will drop the all time aggregation and use the multizarr tool on the fly to build the date range that I need from the monthly chunks. I am on the train right now with terrible wifi - I will try and grab memory use stats for you later.

Best

David

On Wed, Dec 14, 2022 at 11:07 AM Martin Durant @.***> wrote:

After creating the filesystem for one year, I see 1.2GB in use. I'll look into it.

I am indeed working on the parquet backend, which should give better memory footprint per reference set; but strings are still strings, so all those paths add up once in memory unless templates are only applied at access time. Hm.

However, it may be possible, instead, to make the combine process not need to load all the reference sets up front.

— Reply to this email directly, view it on GitHub https://github.com/fsspec/kerchunk/issues/209#issuecomment-1351697634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDN2WZZXEIBCVZUNNJOOMDWNHWCNANCNFSM555CNQHA . You are receiving this because you were mentioned.Message ID: @.***>

emfdavid avatar Dec 14 '22 19:12 emfdavid

Thanks for the update @emfdavid. Are those HRRR JSONs in a publically-accessible bucket? (perhaps requester-pays?) Have an example notebook?

rsignell-usgs avatar Dec 15 '22 14:12 rsignell-usgs

@martindurant I was able to create four 10 year combined JSONs from the 40 individual yearly JSON files.

The process to create each of these 10-year files took 16GB of the 32GB memory for the Xlarge instance at https://jupyer.qhub.esipfed.org.

I was unable to create the 40 year combined file from these four 10 year files though -- it blew the 32GB memory

rsignell-usgs avatar Dec 15 '22 14:12 rsignell-usgs

Try https://github.com/fsspec/kerchunk/pull/272

martindurant avatar Dec 15 '22 14:12 martindurant

With latest commits in 272, I could combine 13 years directly with peak memory around 13GB.

martindurant avatar Dec 15 '22 20:12 martindurant

Just to make sure I've got the right version, I have this. You?

09:53 $ conda list kerchunk
# packages in environment at /home/conda/users/envs/pangeo:
#
# Name                    Version                   Build  Channel
kerchunk                  0.0.1+420.gca577c4.dirty          pypi_0    pypi

rsignell-usgs avatar Dec 16 '22 16:12 rsignell-usgs

yes

martindurant avatar Dec 16 '22 16:12 martindurant

FYI the Pangeo ML augmentation with support for some of these tasks through the NASA ACCESS 2019 program is on FigShare.

maxrjones avatar Mar 13 '23 22:03 maxrjones