xarray
xarray copied to clipboard
Segfault writing large netcdf files to s3fs
What happened?
It seems netcdf4 does not work well currently with s3fs
the FUSE filesystem layer over S3 compatible storage with either the default netcdf4
engine nor with the h5netcdf
.
Here is an example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
On my system this code crashes with NTIMES=48, but completes without an error with NTIMES=24.
The output with NTIMES=48
is
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
I have tried the other engine that handles NETCDF4 in xarray with engine='h5netcdf'
and also got a segfault.
A quick workaround seems to be to use the local filesystem to write the NetCDF file and then move the complete file to S3.
ds.to_netcdf(path='/tmp/test_netcdf.nc', format='NETCDF4', mode='w')
shutil.move('/tmp/test_netcdf.nc', '/my_s3_fs/test_netcdf.nc')
There are several pieces of software involved here: the xarray package (0.16.1), netcdf4 (1.5.4), HDF5 (1.10.6), and s3fs (1.79). If this is not a bug in my code but in the underlying libraries, most likely it is not an xarray bug, but since it fails with both Netcdf4 engines, I decided to report it here.
What did you expect to happen?
With NTIMES=24 I am getting a file /my_s3_fs/test_netcdf.nc
of about 7.8 MBytes. WIth NTIMES=36 I get an empty file. I would expect to have this code run without a segfault and produce a nonempty file.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:43:00) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4
xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: 1.0.2 h5py: 3.1.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: None matplotlib: 3.3.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 50.3.0.post20201006 pip: 20.2.3 conda: 22.9.0 pytest: 6.1.1 IPython: 7.18.1 sphinx: None
That's quite an old version of xarray! Could we confirm it has similar results on a more recent version?
Can confirm the issue with xarray 2022.6.0 and dask 2022.9.2. The latest versions available on conda-forge. The issue might be related to netcdf4 and hdf5 libraries. Will try to update this as well.
libnetcdf, netcdf4 and hdf5 are at their latest versions available on conda-forge
Thanks @d1mach . Could it be related to https://github.com/pydata/xarray/issues/7136 ?
Adding a gdb stackrace.txt from corefile obtained with
docker run -v /mnt/fs:/my_s3_fs -it --rm --ulimit core=-1 --privileged netcdf:latest /bin/bash
and
sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t
python mcve.py
if this crashes with both netcdf4
and h5netcdf
this might be a bug in the libhdf5
library. If we can manage to reduce this to use just h5py
(or netCDF4
), it should be suitable for reporting on their issue tracker, and those libraries can then push it further to libhdf5
(otherwise, if you're up for investigating / debugging the C library, you could also report to libhdf5
directly).
As for the MCVE: I wonder if we can trim it a bit. Can you reproduce with
import xarray as xr
import pandas as pd
N_TIMES = 48
time_vals = pd.date_range("2022-10-06", freq="20 min", periods=N_TIMES)
ds = xr.Dataset({"time": ("T", time_vals)})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")
or, if it is important to have bytes:
ds2 = ds.time.dt.strftime("%Y%m%d%H%M%S").str.encode("utf-8").to_dataset()
ds2.to_netcdf(...)
also it would be interesting to know if this happens only for data variables, or if coordinates have the same effect (use ds2
instead of ds2
if bytes are important):
ds3 = ds.set_coords("time")
ds3.to_netcdf(...)
Will try to reproduce this with h5py. For the bug to show up the file has to be large enough. That is why my example has a 2D array variable alongside the time dimension. With just the time dimension the script completes without an error. All three cases work without an error: ds.to_netcdf()
, ds2.to_netcdf()
, and ds3.to_netcdf()
which ones fail if you add the 3D variable?
The first one results in a segfault:
import numpy as np
import xarray as xr
import pandas as pd
N_TIMES = 48
time_vals = pd.date_range("2022-10-06", freq="20 min", periods=N_TIMES)
ds = xr.Dataset({"time": ("T", time_vals), 'd1': (["T", "x", "y"], np.zeros((len(time_vals), 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")
Not sure how to add the 3D var to the second dataset.
with this:
ds2 = ds.time.dt.strftime("%Y%m%d%H%M%S").str.encode("utf-8").to_dataset().assign(d1=ds.d1)
but we don't really need to check if the first dataset already fails.
Now I'd probably check if it's just the size that makes it fail (i.e. remove "time"
from ds
and keep just d1
while maybe increasing it by one if it does not fail as-is), or if it depends on the dtype
(i.e. replace set time_vals
to np.arange(N_TIMES, dtype=int)
).
It seems that we need the time variable to reproduce the problem. The following code does not fail:
import numpy as np
import xarray as xr
import pandas as pd
N_TIMES=64
ds = xr.Dataset({'d1': (["T", "x", "y"], np.zeros((N_TIMES, 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")
okay, then does changing the dtype do anything? I.e. does this only happen with datetime64
/ bytes, or do int / float / str also fail?
datatype seems to be not important. But the two variables are required to get a segfault. The following with just floats produces a segfault
import numpy as np
import xarray as xr
N_TIMES=48
ds = xr.Dataset({"time": ("T", np.zeros((N_TIMES))), 'd1': (["T", "x", "y"], np.zeros((N_TIMES, 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")
great, good to know. Can you try this with h5py
:
import h5py
N_TIMES = 48
with h5py.File("test.nc", mode="w") as f:
time = f.create_dataset("time", (N_TIMES,), dtype="i")
time[:] = 0
d1 = f.create_dataset("d1", (N_TIMES, 201, 201), dtype="f")
d1[:] = 0
I had to change ints and floats to doubles to reproduce the issue.
import h5py
N_TIMES = 48
with h5py.File("/my_s3_fs/test.nc", mode="w") as f:
time = f.create_dataset("time", (N_TIMES,), dtype="d")
time[:] = 0
d1 = f.create_dataset("d1", (N_TIMES, 201, 201), dtype="d")
d1[:] = 0
Since we have eliminated xarray
with this, you should be able to submit an issue to the h5py
issue tracker while mentioning that this is probably a bug in libhdf5
since netcdf4
also fails with the same error (and you can also link this issue for more information)
Closing as upstream issue, please reopen if that's mistaken