netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Very frequent segfaults with the new `netCDF4=1.6.1`

Open valeriupredoi opened this issue 2 years ago • 32 comments

Heads up guys, we are seeing some very frequent segfaults in our CI when we have the new, hours-old, netCDF4=1.6.1 in our environment. It's most probably due to it, since HDF5 has been at 1.12.2 for a while now - more than a month, and with netCDF4=1.6.0 all works fine (and other packages staying at the same version and hash point). Apologies if this proves out to be due to a different package, but better safe than sorry in terms of a forewarning. Cheers muchly :beer:

valeriupredoi avatar Sep 15 '22 15:09 valeriupredoi

Looks like you are using conda-forge's netcdf4. Maybe open an issue at https://github.com/conda-forge/netcdf4-feedstock instead.

PS: could you also test the wheels just to be sure they are OK?

ocefpaf avatar Sep 15 '22 15:09 ocefpaf

@ocefpaf good call, mate! Will do so, cheers :beer:

valeriupredoi avatar Sep 15 '22 15:09 valeriupredoi

We're also having issues on yt with the windows wheels for version 1.6.1. Namely, h5py is raising a warning at import. See https://github.com/yt-project/yt/issues/4128

neutrinoceros avatar Sep 15 '22 17:09 neutrinoceros

Same with us - see SciTools/iris#4968. Sometimes manifests as segfaults, sometimes as crashed GHA workers (maybe segfault underneath).

@ocefpaf I've confirmed the same problems appear when installing from PyPI OR from conda-forge.

trexfeathers avatar Sep 16 '22 12:09 trexfeathers

@ocefpaf I've confirmed the same problems appear when installing from PyPI OR from conda-forge.

Many thanks, I was about to test the PyPi version - cheers for testing, that saves me some lunch time :grin:

valeriupredoi avatar Sep 16 '22 12:09 valeriupredoi

@trexfeathers what platforms are failing when you tested the PyPI wheels? I'm particularly interested in the Windows wheels for 1.6.1 b/c those are built in a different way now.

ocefpaf avatar Sep 16 '22 13:09 ocefpaf

@trexfeathers what platforms are failing when you tested the PyPI wheels? I'm particularly interested in the Windows wheels for 1.6.1 b/c those are built in a different way now.

GHA's ubuntu-latest. Despite several attempts we are yet to get Iris' test suite working on Windows.

trexfeathers avatar Sep 16 '22 13:09 trexfeathers

@ocefpaf to clarify, on yt we're testing with PyPI wheels for all three major platforms, and we're only seeing issues on windows.

neutrinoceros avatar Sep 16 '22 14:09 neutrinoceros

ah I realized I've not clarified it myself: segfaults from a conda-forge install on both ubuntu-latest, OSX-latest on GHA, and ubuntu on CircleCI, off conda-forge as well, no Win testing for us since we've not been able to get a working install of our packages there either, snif but not snif :grin:

valeriupredoi avatar Sep 16 '22 14:09 valeriupredoi

So here's a pretty interesting case study that may lead to fixing this current issue - and a rather seldomly recurring issue @agstephens and myself have noticed in the past, with older (and stable, bullet-proof versions of netCDF4):

  • our tests fail with a multitude of HDF-related SegFaults, complaints about close dataset etc, anyway:
  • I tried recreating the netcdf sample data that the tests fail on, but I couldn't recreate it since the same segfaults crept up while trying to create them, so...
  • I simply moved them out and back in the location where they should be, and...
  • no more Segfaults (yes I ran quite a few iterations of the test so I am 100% sure they don't fail)

My colleague Ag noticed the same behaviour, way back, on very very few occasions - an HDF Segfault on a certain file would automagically disappear if we moved the file out and back in its location, we blamed it on the FS back then, but thinking in restrospect, it could be the same issue here?

valeriupredoi avatar Sep 16 '22 15:09 valeriupredoi

and we're only seeing issues on windows.

~~@jswhit it may be prudent to yank those wheels until we figure out what is going on. They do pass the tests in the repo but are not holding well the "production test" :-/~~

However, the other platforms are failing in other CIs so this is quite confusing and we'll need the reports here to help us sort this out.


Edit: @neutrinoceros your report upstream is about h5py and not netcdf4, right? xref: https://github.com/yt-project/yt/issues/4128

ocefpaf avatar Sep 16 '22 17:09 ocefpaf

We're only seeing a warning and yes, it's triggered from h5py. I'm assuming it's the same underlying issue, but that's a wild guess.

neutrinoceros avatar Sep 16 '22 17:09 neutrinoceros

We're only seeing a warning and yes, it's triggered from h5py. I'm assuming it's the same underlying issue, but that's a wild guess.

Most likely not. Folks here are experiencing segfaults with the latest netcdf4-python. The h5py warning in your CI is just b/c one version of hdf5 was used to build but another one is used to run. In my experience that is OK 99.99% of the cases.

ocefpaf avatar Sep 16 '22 18:09 ocefpaf

Should I file another issue ?

neutrinoceros avatar Sep 16 '22 20:09 neutrinoceros

Should I file another issue ?

Probably not. It'll be closed b/c it is a known warning that is mostly harmless.

ocefpaf avatar Sep 16 '22 22:09 ocefpaf

It's not clear to me whether this is an issue with all the wheels for 1.6.1, or just the windows wheels? It's hard to see why the linux and macosx wheels would be a problem, since they are built exactly the same way as they were for 1.6.0. The most significant code change in netcdf4-python in 1.6.1 is PR #1181, but I don't see how this could cause segfaults.

jswhit avatar Sep 17 '22 01:09 jswhit

I want to share here a workaround I've been using to deal with the netcdf4 python package issue for my projects. After installing all other dependencies, I reinstall netcdf4-python from source with the following (this has solved my issues):

python -m pip install --upgrade --force-reinstall --no-deps --no-cache-dir netcdf4 --no-binary netcdf4

Echoing @jswhit, I mentioned in another issue that I don't think the problem is the code, but the wheel-building process, since installing from sources works perfectly fine.

In any case, this is a really mysterious problem!

Zeitsperre avatar Sep 19 '22 20:09 Zeitsperre

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

jswhit avatar Sep 20 '22 02:09 jswhit

@jswhit would https://github.com/Unidata/netcdf4-python/issues/1192#issuecomment-1249475852 give you some sort of a clue what might trigger those intermittent but rather frequent SegFaults? It's a bit black magic to me at the moment :grin:

valeriupredoi avatar Sep 20 '22 12:09 valeriupredoi

hey guys, it appears @Zeitsperre is correct and this whole segfaulting issue is happening as a cause of some installation problem: I went the conda-forge way and did a couple black box tests, see below

  • I installed netdf4=1.6.1 as part of our environment, as per usual, and as in the CI case where we noticed the SegFaults, and ran the test that has the tendency to segfault, with 0 and 2 processes: in both cases the test failed (either S: segfault or H: HDF error, see below) 4 out of 12 times, so 8/24 in total; note that this is done on a stable single machine with no other load or shared access;
  • then I downgraded to 1.6.0, and, as expected, nothing poops the bed; not that apart from netcdf4 no other lib has changed version, or build hash;
  • then I re-upgraded (again, just netcdf4 changed, no other dependency) and ran the same experiment, this time around noticing a visibly reduced frequency of fails at 4/24;

Could it be that the conda compilers are not preserving the rght flags or compilation order for you, specifically for 1.6.1? I know (extreme) cases where people need to compile numpy since the conda-forge supplied version is giving them headaches due to numerical precision deltas from version to version, but that's normal(ish). Anyways, here's my test results:

conda-forge install via mamba

netcdf4=1.6.1

  • pytest -n 0 result: SSS000000H00 4/12 fails
  • pytest -n 2 result: SS00H000H000 4/12 fails

downgrade to 1.6.0:

  - netcdf4    1.6.1  nompi_py310h55e1e36_100  conda-forge                    
  + netcdf4    1.6.0  nompi_py310h55e1e36_102  conda-forge/linux-64

(nothing else changed in the conda env)

netcdf4=1.6.0

  • pytest -n 0 result: 0000... no fails
  • pytest -n 2 result: 0000... no fails

reupgrade:

  - netcdf4    1.6.0  nompi_py310h55e1e36_102  conda-forge                    
  + netcdf4    1.6.1  nompi_py310h55e1e36_100  conda-forge/linux-64

netcdf4=1.6.1

  • pytest -n 0 result: 00000000H000 1/12 fails
  • pytest -n 2 result: S0000000SS00 3/12 fails

Legend

  • 0: pass OK
  • S: segfault
  • H -> HDF error:
tests/sample_data/multimgdel_statistics/test_multimodel.py:237: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/sample_data/multimodel_statistics/test_multimodel.py:197: in multimodel_regression_test
    result = multimodel_test(cubes, statistic=statistic, span=span)
tests/sample_data/multimodel_statistics/test_multimodel.py:178: in multimodel_test
    result = multi_model_statistics(products=cubes,
esmvalcore/preprocessor/_multimodel.py:493: in multi_model_statistics
    return _multicube_statistics(
esmvalcore/preprocessor/_multimodel.py:388: in _multicube_statistics
    result_cube = _compute_eager(aligned_cubes,
esmvalcore/preprocessor/_multimodel.py:319: in _compute_eager
    _ = [cube.data for cube in cubes]  # make sure the cubes' data are realized
esmvalcore/preprocessor/_multimodel.py:319: in <listcomp>
    _ = [cube.data for cube in cubes]  # make sure the cubes' data are realized
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/cube.py:2315: in data
    return self._data_manager.data
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_data_manager.py:206: in data
    result = as_concrete_data(self._lazy_array)
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_lazy_data.py:252: in as_concrete_data
    (data,) = _co_realise_lazy_arrays([data])
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/_lazy_data.py:215: in _co_realise_lazy_arrays
    computed_arrays = da.compute(*arrays)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/base.py:600: in compute
    results = schedule(dsk, keys, **kwargs)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/threaded.py:89: in get
    results = get_async(
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:511: in get_async
    raise_exception(exc, tb)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:319: in reraise
    raise exc
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/local.py:224: in execute_task
    result = _execute_task(task, data)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/optimization.py:990: in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:149: in get
    result = _execute_task(task, cache)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/utils.py:71: in apply
    return func(*args, **kwargs)
../miniconda3/envs/flake8/lib/python3.10/site-packages/dask/array/core.py:122: in getter
    c = a[b]
../miniconda3/envs/flake8/lib/python3.10/site-packages/iris/fileformats/netcdf.py:418: in __getitem__
    dataset.close()
src/netCDF4/_netCDF4.pyx:2624: in netCDF4._netCDF4.Dataset.close
    ???
src/netCDF4/_netCDF4.pyx:2587: in netCDF4._netCDF4.Dataset._close
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: NetCDF: HDF error

src/netCDF4/_netCDF4.pyx:2028: RuntimeError
------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------
HDF5-DIAG: Error detected in HDF5 (1.12.2) MPI-process 0:
  #000: H5D.c line 320 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
============================================

valeriupredoi avatar Sep 20 '22 13:09 valeriupredoi

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

Zeitsperre avatar Sep 20 '22 14:09 Zeitsperre

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

Are you using wheels from pypi or conda to install?

jswhit avatar Sep 20 '22 14:09 jswhit

Are you using wheels from pypi or conda to install?

Folks, please, everyone that is using the package from conda-forge post your issues and comments in https://github.com/conda-forge/netcdf4-feedstock/issues/141 and not here. Let's help out with the triage so we can solve this!

ocefpaf avatar Sep 20 '22 14:09 ocefpaf

cheers @ocefpaf - I'll link my comment above with the test results to the feedstock issue, good point! I am still not 100% sure it's just conda, or PyPi installations, or the code itself that's causing this, that's why I was primarily posting guff here so the experts may be able to get some clues :beer:

valeriupredoi avatar Sep 20 '22 14:09 valeriupredoi

@Zeitsperre are you having a problem with windows wheels only, or also the linux and macosx wheels?

I'm only testing on Linux systems, so nothing for me to report on Windows or macOS.

Are you using wheels from pypi or conda to install?

The PyPI wheels have not been working for me, but the conda binaries have been fine for me on Linux.

Zeitsperre avatar Sep 20 '22 14:09 Zeitsperre

@valeriupredoi reported at https://github.com/conda-forge/netcdf4-feedstock/issues/141 that his segfaults were all related to the use of file caching, and if the file is read directly from disk the segfaults go away. Are others experiencing segfaults also using some sort of caching of netCDF4.Dataset objects?

jswhit avatar Sep 21 '22 16:09 jswhit

From the discussion at https://github.com/conda-forge/netcdf4-feedstock/issues/141, it looks like at least some of the segfaults are related to using netcdf4-python within threads. netcdf-c is not thread-safe, and releasing the GIL on all netcdf-c calls (introduced in 1.6.1) has increased the probability of segfaults when threads are used.

jswhit avatar Sep 22 '22 17:09 jswhit

Folks using iris and hitting this issue you can workaround it by setting dask to single-threaded with:

import dask
dask.config.set(scheduler="single-threaded")

instead of pinning to netcdf4!=1.6.1.

ocefpaf avatar Oct 04 '22 18:10 ocefpaf

There is an experimental PR in netcdf-c that makes the C library threadsafe. This should fix many (all?) of the problems reported here, but won't be available in a released version for some time.

jswhit avatar Oct 07 '22 21:10 jswhit

There is an experimental PR in netcdf-c that makes the C library threadsafe. This should fix many (all?) of the problems reported here, but won't be available in a released version for some time.

@jswhit would you expect more releases of NetCDF4 before this feature is released?

trexfeathers avatar Oct 10 '22 08:10 trexfeathers