Progress bar on open_mfdataset
Is your feature request related to a problem?
I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that something is happening, so a progress bar would be nice. I found an example of using a progress bar from dask here: https://github.com/pydata/xarray/issues/4000#issuecomment-619003228
However, my attempt to adapt this solution doesn't show a progress bar. Any other options?
Here is the code I tried:
from dask.diagnostics import ProgressBar
with ProgressBar():
d = xr.open_mfdataset('proc/*.nc')
Describe the solution you'd like
I'd like to see a nice and fairly minimal progress bar, for example telling me how many files have been dealt with so far.
Describe alternatives you've considered
Something based on tqdm would be nice, but could also be something else.
Additional context
No response
After discussion with a colleague, we ended up with this solution:
import xarray as xr
from dask.diagnostics import ProgressBar
with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1)) as d, ProgressBar():
d.load()
This works in the strict sense that it displays a progress bar, but unfortunately it does nothing (no progress bar visible) for a couple of minutes (for the set of files I tested), and then the progress bar shows up and runs through in a few seconds. In other words, not very useful for an impatient soul like me.
I should add that I'm testing this in a jupyter notebook.
indeed, this does nothing if you don't pass parallel=True to open_mfdataset. What that does is parallelize the access to each file by creating one dask task per open_dataset on each file. Without it, open_dataset is called on each file in sequence without going through dask, so you don't get any feedback from dask.
The activity on the progress bar you get is the loading of each chunk into memory, which happens when you call d.load(), and so after the call to open_mfdataset.
What I think you should try is:
import xarray as xr
from dask.diagnostics import ProgressBar
with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1), parallel=True) as d, ProgressBar():
d.load()
(though you might not need the explicit chunks of index=1, this could also be chunks={})
The proposed solution does not seem to work for me. The solution shows a progress bar for .load(), but not for open_mfdataset(), I think. I am running the code below. The call to open_mfdataset() takes about 10min, but no progress bar is being displayed. Maybe this issue is related to using progress instead of ProgressBar (see stackoverflow)?
from dask.diagnostics import ProgressBar
print('Start opening')
with xr.open_mfdataset(filepaths,
combine='nested',
concat_dim='Time',
parallel=True) as ds, ProgressBar():
print('Opened files into xr.')