cf-python icon indicating copy to clipboard operation
cf-python copied to clipboard

Non-deterministic segmentation faults throughout test suite

Open sadielbartholomew opened this issue 4 years ago • 2 comments

Some months ago seg faults appeared en masse both locally (for multiple testers) and on GitHub Actions, hitting various modules in the test suite though occurring for specific individual modules sporadically. These faults have persisted. Though I hoped and tried somewhat to discreetly investigate and fix the source (we don't have any external contributors ATM so it didn't seem urgent to broadcast), it is proving quite difficult to pinpoint, so an Issue is overdue to register this.

Details about the observed seg faults are provided below. I intend this to become an evidence log of sorts to hopefully guide us to getting to the root of the problem.

ESMValGroup/ESMValCore#644 is possibly relevant because it indicates similar symptoms in ESMValCore. I had a chat with some of the ESMValGroup devs today to see if we can help each other in these potentially-linked investigations, so this Issue is also to assist them with comparisons.

General details

  1. We have not seen, or heard of anyone else seeing, any seg faults during actual cf-python usage, so they only seem to occur when running some or all of the tests;
  2. The seg faulting occurs both on Actions and locally for both developers who have tried it on their machines.

Affected test modules and methods

These are test modules which we have observed to seg fault at least once, though in most cases they do not always seg fault for a given environment (OS, conda and pip libraries etc.) and Python version. (I've been running for filename in test_*.py; do python $filename; done to run as many test methods as possible without a single seg fault stopping the experiment:

  • test_Field: test_Field_close
  • test_pp: ? [specific method(s) unknown]
  • test_gathering: ?
  • test_CoordinateReference: ?
  • test_dsg: ?
  • test_groups: ?
  • test_read_write: test_read_write_format

Example seg fault traceback

(Captured using faulthandler which I recently enabled for all of the test modules.)

$ python test_groups.py 
Run date: 2021-01-06 18:21:22.812031
Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10 
HDF5 library: 1.10.6 
netcdf library: 4.7.4 
Python: 3.8.5 /home/sadie/anaconda3/envs/cf-env/bin/python
netCDF4: 1.5.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netCDF4/__init__.py
numpy: 1.19.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/numpy/__init__.py
cfdm.core: 1.8.8.0 /home/sadie/cfdm/cfdm/core/__init__.py
cftime: 1.3.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/cftime/__init__.py
netcdf_flattener: 1.2.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netcdf_flattener/__init__.py
cfdm: 1.8.8.0 /home/sadie/cfdm/cfdm/__init__.py

test_groups (__main__.GroupsTest) ... Fatal Python error: Segmentation fault

Current thread 0x00007f1a3fc89740 (most recent call first):
  File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 484 in open
  File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 133 in __getitem__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 264 in __getitem__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 542 in _item
  File "/home/sadie/cfdm/cfdm/data/data.py", line 2491 in last_element
  File "/home/sadie/cfdm/cfdm/data/data.py", line 455 in __str__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 212 in __repr__
  File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 2949 in _create_field
  File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 1355 in read
  File "/home/sadie/cfdm/cfdm/decorators.py", line 189 in verbose_override_wrapper
  File "/home/sadie/cfdm/cfdm/read_write/read.py", line 295 in read
  File "test_groups.py", line 81 in test_groups
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 676 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 736 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/runner.py", line 176 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 271 in runTests
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 101 in __init__
  File "test_groups.py", line 418 in <module>
Segmentation fault (core dumped)

sadielbartholomew avatar Jan 22 '21 21:01 sadielbartholomew

@sadielbartholomew you may want to think twice before running those types of tests with pytest and xdist (if you guys are planning on switching to that testing infrastructure), see here

valeriupredoi avatar Mar 30 '21 14:03 valeriupredoi

Thanks @valeriupredoi, I'll take a look at your findings. Sorry haven't posted here since we all discussed this, I've not have too much time to look into it and indeed have no findings to report myself other than more speculation with some wishy-washy evidence. Waiting until I have something more concrete to report. Sigh...

sadielbartholomew avatar Mar 30 '21 14:03 sadielbartholomew

Closing since we sorted these quite a while back and forgot to close this.

sadielbartholomew avatar Apr 13 '23 17:04 sadielbartholomew