rechunker icon indicating copy to clipboard operation
rechunker copied to clipboard

cannot use rechunker starting from `0.4.0`

Open apatlpo opened this issue 3 years ago • 18 comments

I apologize in advance for posting an issue that may be incomplete.

After a recent library update I am no longer able to use rechunker for my use case. This on an hpc platform

Symptoms are that nothing happens on the dask dashboard when launching the actual rechunking with execute. Using the top command on the scheduler node indicates 100% cpu usage and a slowly increasing memory usage. On the other hand, action takes place right away on the dask dashboard with older version of rechunker (version 0.3.3).

git bisecting versions indicates:

6cc0f26720bfecb1eba00579a13d9b7c8004f652 is the first bad commit

I am not sure what I could do in order to investigate further and would welcome suggestions.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 3.12.53-60.30-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.18.2 pandas: 1.2.4 numpy: 1.20.3 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.06.0 distributed: 2021.06.0 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: 0.11.1 numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.1.2 conda: None pytest: None IPython: 7.24.1 sphinx: None

apatlpo avatar Jun 08 '21 21:06 apatlpo

Thanks for reporting this. Indeed it looks like 6cc0f26720bfecb1eba00579a13d9b7c8004f652 (part of #77) caused this problem.

@TomAugspurger - this sounds very similar to the problems we are having in Pangeo Forge, which we are attempting to fix in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153.

The basic theory is that the way we are building dask graphs results post #77 results in objects that are much too large.

Aurelien, if you wanted to help debug, you could run the same steps that Tom ran in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/116#issuecomment-839905902 and report the results back here.

From my point of view, we could simply revert #77. The main point of that was to have a representation of pipelines we could share between rechunker and pangeo-forge-recipes. But since we are likely abandoning the rechunker dependency in pangeo-forge-recipes, this no longer seems necessary.

rabernat avatar Jun 09 '21 12:06 rabernat

Here is the result with version 0.3.3:

hl_key time size
array 0.0001261560246348381 51329.0
barrier 4.3830834329128265e-05 594.0
from-zarr 0.00048235198482871056 25113.0
from-zarr-store 0.0008570537902414799 7104.0
getitem 0.09110138937830925 871471.0
open_dataset 0.00033695297315716743 1597.0
rechunked 7.380731403827667e-06 132.0
store 0.006245309021323919 224540.0

and with latest version (73ef80b):

hl_key time size
copy_chunk 381.4043052890338 9747762599.0
merge 1.449882984161377e-05 304.0
stage 0.0010634679347276688 19195.0

Please let me know if I need to adjust anything, you find the full code in the following notebook

apatlpo avatar Jun 14 '21 14:06 apatlpo

Ok that definitely confirms my suspicion that the way we are generating dask graphs is not efficient. 🤦 Looks like we will have to revert quite a few changes.

@apatlpo - how urgent is this problem for you? Can you just keep using 0.3.0 for now?

rabernat avatar Jun 14 '21 14:06 rabernat

absolutely no rush on my side, I am perfectly happy with 0.3.3 (sorry corrected typo in earlier post), thx for your concern.

apatlpo avatar Jun 14 '21 18:06 apatlpo

I just stumbled across this, having a similar use case. For now, using 0.3.3 seems to be fine for me as well.

d70-t avatar Jun 16 '21 18:06 d70-t

I also wanted to second this - I was having problems rechunking a variable that had many small chunks. Dask would get stuck in the first few moments of trying to rechunk. Reverting to 0.3.3 solved this issue for me.

rybchuk avatar Jul 04 '21 12:07 rybchuk

+1 on this issue. Reverting to 0.3.3 seems to solve it, though unrelatedly I'm getting workers using 2-3x the max_mem setting (possibly related to the issues of #54). This might be fixed by v0.4, I just haven't been able to test it).

That being said, anecdotally I seem to see workflows using more memory than expected with recent updates to dask/distributed starting around 2021.6 I believe, so it might be unrelated to rechunker. Going to do some more investigating

bolliger32 avatar Jul 17 '21 10:07 bolliger32

I haven't had a chance to look at https://github.com/pangeo-data/rechunker/commit/6cc0f26720bfecb1eba00579a13d9b7c8004f652, but https://github.com/pangeo-forge/pangeo-forge-recipes/issues/116 was determined to be some kind of serialization problem, and was fixed by https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160.

Just to confirm, when people say "using more memory" is that on the workers or the scheduler?

TomAugspurger avatar Jul 17 '21 11:07 TomAugspurger

@TomAugspurger for me it's the workers, and it's usually just one or two workers that quickly balloon up to that size while the remainder seem to respect max_mem. Let me see if I can come up with an example (might put it in a different issue since it's separate than the main thread of this one)

bolliger32 avatar Jul 17 '21 12:07 bolliger32

Apologies for the slow pace here. I've been on vacation much of the past month.

@TomAugspurger - I'd like chart a path towards resolving this. As discussed in today's Pangeo Forge meeting, this issue has implications for Pangeo Forge architecture. We have to decide whether to maintain the Pipeline abstraction or deprecate it. The performance problems described in this issue are closely tied to the switch to the Pipeline executor model. However, as you noted today, the Pipeline model is not inherently different from what is now implemented directly in BaseRecipe.to_dask() (https://github.com/pangeo-forge/pangeo-forge-recipes/blob/49997cb52cff466bd394c1348ef23981e782a4d9/pangeo_forge_recipes/recipes/base.py#L113-L154). So really the difference comes down to the way the various functions and objects are partialized / serialized.

I'd personally like to keep the Pipelines framework and even break it out into its own package. But that only makes sense if we can get it working properly within rechunker first.

I understand that you (Tom) probably don't have lots of time to dig into this right now. If that's the case, it would be great if you could at least help outline a work plan to get to the bottom of this issue. We have several other developers (e.g. me, @cisaacstern, @alxmrs, @TomNicholas) who could potentially be working on this. But your input would be really helpful to get started.

In https://github.com/pangeo-forge/pangeo-forge-recipes/issues/116#issuecomment-839905902, Tom made a nice diagnosis of the serializability issues in Paneo Forge recipes. Perhaps we could do the same here. We would need a minimum reproducible example for this problem.

rabernat avatar Aug 16 '21 21:08 rabernat

+1 & thanks! I'll note that reverting solved my issues. In case it helps anyone else, the symptoms of my issues were roughly that time_to_solution ~ e^(chunk_size). For my desired chunk size in time i could not complete the rechunking in my cluster's walltime (6 hours), reverting i can get it in about 90 seconds.

jmccreight avatar Sep 10 '21 17:09 jmccreight

@jmccreight - would you be able to provide any more details? Were you using Xarray or Zarr inputs? Which executor? Code would be even better. Thanks so much for your help.

rabernat avatar Sep 10 '21 17:09 rabernat

Hi @rabernat! I'm basically doing what @rsignell-usgs did here: https://nbviewer.jupyter.org/gist/rsignell-usgs/c0b87ed1fa5fc694e665fb789e8381bb

The quick overview:

  • bunch of output files, one for each timestep
  • loop on n_time_chunks to process, each time grabbing len(files) = n_files_in_time_chunk
    • ds = xr.open_mfdataset(files)
    • rechunked = rechunk(ds, chunk_plan, max_mem, step_file, temp_file)
    • result = rechunked.execute(retries=10). <-------- problem here
    • ds = xr.open_zarr(file_step, consolidated=False)
    • if not file_chunked.exists(): ds.to_zarr(file_chunked, consolidated=True, mode='w') else: ds.to_zarr(file_chunked, consolidated=True, append_dim='time')

jmccreight avatar Sep 10 '21 19:09 jmccreight

I just hit this again -- one of my USGS Colleagues had rechunker=0.4.2 and it was hanging, and after reverting to rechunker=0.3.3, it worked fine.

rsignell-usgs avatar Jan 25 '22 17:01 rsignell-usgs

Sorry for all the friction everyone! This package needs some maintenance. In the meantime, should we just pull the 0.4.2 release from pypi?

rabernat avatar Jan 25 '22 17:01 rabernat

Ok so #112 has been merged, and the current master should have a solution to this issue.

@apatlpo - any chance you would be able to try your test with the latest rechunker master branch to confirm it fixes your original problem?

rabernat avatar Mar 23 '22 22:03 rabernat

I'm not the O.P but I had a similar issue. I tried the latest version on master and the rechunking went smoothly. I tried it on a LocalCluster on my laptop if that makes any difference.

bzah avatar Mar 29 '22 13:03 bzah

Was this fixed in recent releases / can we close?

max-sixty avatar Nov 02 '23 20:11 max-sixty