xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Segfault writing large netcdf files to s3fs

Open d1mach opened this issue 1 year ago • 16 comments

What happened?

It seems netcdf4 does not work well currently with s3fs the FUSE filesystem layer over S3 compatible storage with either the default netcdf4 engine nor with the h5netcdf.

Here is an example

import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')

On my system this code crashes with NTIMES=48, but completes without an error with NTIMES=24.

The output with NTIMES=48 is

There are 1 HDF5 objects open!

Report: open objects on 72057594037927936
Segmentation fault (core dumped)

I have tried the other engine that handles NETCDF4 in xarray with engine='h5netcdf' and also got a segfault.

A quick workaround seems to be to use the local filesystem to write the NetCDF file and then move the complete file to S3.

ds.to_netcdf(path='/tmp/test_netcdf.nc', format='NETCDF4', mode='w')
shutil.move('/tmp/test_netcdf.nc', '/my_s3_fs/test_netcdf.nc')

There are several pieces of software involved here: the xarray package (0.16.1), netcdf4 (1.5.4), HDF5 (1.10.6), and s3fs (1.79). If this is not a bug in my code but in the underlying libraries, most likely it is not an xarray bug, but since it fails with both Netcdf4 engines, I decided to report it here.

What did you expect to happen?

With NTIMES=24 I am getting a file /my_s3_fs/test_netcdf.nc of about 7.8 MBytes. WIth NTIMES=36 I get an empty file. I would expect to have this code run without a segfault and produce a nonempty file.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

There are 1 HDF5 objects open!

Report: open objects on 72057594037927936
Segmentation fault (core dumped)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:43:00) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: 1.0.2 h5py: 3.1.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: None matplotlib: 3.3.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 50.3.0.post20201006 pip: 20.2.3 conda: 22.9.0 pytest: 6.1.1 IPython: 7.18.1 sphinx: None

d1mach avatar Oct 08 '22 16:10 d1mach

That's quite an old version of xarray! Could we confirm it has similar results on a more recent version?

max-sixty avatar Oct 08 '22 17:10 max-sixty

Can confirm the issue with xarray 2022.6.0 and dask 2022.9.2. The latest versions available on conda-forge. The issue might be related to netcdf4 and hdf5 libraries. Will try to update this as well.

d1mach avatar Oct 08 '22 17:10 d1mach

libnetcdf, netcdf4 and hdf5 are at their latest versions available on conda-forge

d1mach avatar Oct 08 '22 17:10 d1mach

Thanks @d1mach . Could it be related to https://github.com/pydata/xarray/issues/7136 ?

max-sixty avatar Oct 08 '22 17:10 max-sixty

Adding a gdb stackrace.txt from corefile obtained with

docker run -v /mnt/fs:/my_s3_fs -it --rm --ulimit core=-1 --privileged netcdf:latest /bin/bash

and

sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t
python mcve.py

d1mach avatar Oct 09 '22 10:10 d1mach

if this crashes with both netcdf4 and h5netcdf this might be a bug in the libhdf5 library. If we can manage to reduce this to use just h5py (or netCDF4), it should be suitable for reporting on their issue tracker, and those libraries can then push it further to libhdf5 (otherwise, if you're up for investigating / debugging the C library, you could also report to libhdf5 directly).

As for the MCVE: I wonder if we can trim it a bit. Can you reproduce with

import xarray as xr
import pandas as pd

N_TIMES = 48
time_vals = pd.date_range("2022-10-06", freq="20 min", periods=N_TIMES)
ds = xr.Dataset({"time": ("T", time_vals)})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")

or, if it is important to have bytes:

ds2 = ds.time.dt.strftime("%Y%m%d%H%M%S").str.encode("utf-8").to_dataset()
ds2.to_netcdf(...)

also it would be interesting to know if this happens only for data variables, or if coordinates have the same effect (use ds2 instead of ds2 if bytes are important):

ds3 = ds.set_coords("time")
ds3.to_netcdf(...)

keewis avatar Oct 09 '22 12:10 keewis

Will try to reproduce this with h5py. For the bug to show up the file has to be large enough. That is why my example has a 2D array variable alongside the time dimension. With just the time dimension the script completes without an error. All three cases work without an error: ds.to_netcdf(), ds2.to_netcdf(), and ds3.to_netcdf()

d1mach avatar Oct 09 '22 13:10 d1mach

which ones fail if you add the 3D variable?

keewis avatar Oct 09 '22 13:10 keewis

The first one results in a segfault:

import numpy as np
import xarray as xr
import pandas as pd

N_TIMES = 48
time_vals = pd.date_range("2022-10-06", freq="20 min", periods=N_TIMES)
ds = xr.Dataset({"time": ("T", time_vals), 'd1': (["T", "x", "y"], np.zeros((len(time_vals), 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")

Not sure how to add the 3D var to the second dataset.

d1mach avatar Oct 09 '22 13:10 d1mach

with this:

ds2 = ds.time.dt.strftime("%Y%m%d%H%M%S").str.encode("utf-8").to_dataset().assign(d1=ds.d1)

but we don't really need to check if the first dataset already fails.

Now I'd probably check if it's just the size that makes it fail (i.e. remove "time" from ds and keep just d1 while maybe increasing it by one if it does not fail as-is), or if it depends on the dtype (i.e. replace set time_vals to np.arange(N_TIMES, dtype=int)).

keewis avatar Oct 09 '22 13:10 keewis

It seems that we need the time variable to reproduce the problem. The following code does not fail:

import numpy as np
import xarray as xr
import pandas as pd

N_TIMES=64
ds = xr.Dataset({'d1': (["T", "x", "y"], np.zeros((N_TIMES, 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")

d1mach avatar Oct 09 '22 13:10 d1mach

okay, then does changing the dtype do anything? I.e. does this only happen with datetime64 / bytes, or do int / float / str also fail?

keewis avatar Oct 09 '22 14:10 keewis

datatype seems to be not important. But the two variables are required to get a segfault. The following with just floats produces a segfault

import numpy as np
import xarray as xr

N_TIMES=48
ds = xr.Dataset({"time": ("T", np.zeros((N_TIMES))), 'd1': (["T", "x", "y"], np.zeros((N_TIMES, 201,201)))})
ds.to_netcdf(path="/my_s3_fs/test_netcdf.nc", format="NETCDF4", mode="w")

d1mach avatar Oct 09 '22 14:10 d1mach

great, good to know. Can you try this with h5py:

import h5py

N_TIMES = 48
with h5py.File("test.nc", mode="w") as f:
    time = f.create_dataset("time", (N_TIMES,), dtype="i")
    time[:] = 0

    d1 = f.create_dataset("d1", (N_TIMES, 201, 201), dtype="f")
    d1[:] = 0

keewis avatar Oct 09 '22 14:10 keewis

I had to change ints and floats to doubles to reproduce the issue.

import h5py

N_TIMES = 48
with h5py.File("/my_s3_fs/test.nc", mode="w") as f:
    time = f.create_dataset("time", (N_TIMES,), dtype="d")
    time[:] = 0

    d1 = f.create_dataset("d1", (N_TIMES, 201, 201), dtype="d")
    d1[:] = 0

d1mach avatar Oct 09 '22 14:10 d1mach

Since we have eliminated xarray with this, you should be able to submit an issue to the h5py issue tracker while mentioning that this is probably a bug in libhdf5 since netcdf4 also fails with the same error (and you can also link this issue for more information)

keewis avatar Oct 09 '22 14:10 keewis