`test_blockwise_dataframe_io[True-False-hdf]` is flaky on osx CI
I noticed this test periodically failing due to a pytables locking issue today (may have started earlier).
Traceback
_________________ test_blockwise_dataframe_io[True-False-hdf] __________________
[gw0] darwin -- Python 3.8.12 /Users/runner/miniconda3/envs/test-environment/bin/python
c = <Client: 'tcp://127.0.0.1:50922' processes=2 threads=2, memory=28.00 GiB>
tmpdir = local('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/popen-gw0/test_blockwise_dataframe_io_Tr3')
io = 'hdf', fuse = False, from_futures = True
@pytest.mark.filterwarnings(
"ignore:Running on a single-machine scheduler when a distributed client "
"is active might lead to unexpected results."
)
@pytest.mark.parametrize(
"io",
["parquet-pyarrow", "parquet-fastparquet", "csv", "hdf"],
)
@pytest.mark.parametrize("fuse", [True, False, None])
@pytest.mark.parametrize("from_futures", [True, False])
def test_blockwise_dataframe_io(c, tmpdir, io, fuse, from_futures):
pd = pytest.importorskip("pandas")
dd = pytest.importorskip("dask.dataframe")
df = pd.DataFrame({"x": [1, 2, 3] * 5, "y": range(15)})
if from_futures:
parts = [df.iloc[:5], df.iloc[5:10], df.iloc[10:15]]
futs = c.scatter(parts)
ddf0 = dd.from_delayed(futs, meta=parts[0])
else:
ddf0 = dd.from_pandas(df, npartitions=3)
if io.startswith("parquet"):
if io == "parquet-pyarrow":
pytest.importorskip("pyarrow.parquet")
engine = "pyarrow"
else:
pytest.importorskip("fastparquet")
engine = "fastparquet"
ddf0.to_parquet(str(tmpdir), engine=engine)
ddf = dd.read_parquet(str(tmpdir), engine=engine)
elif io == "csv":
ddf0.to_csv(str(tmpdir), index=False)
ddf = dd.read_csv(os.path.join(str(tmpdir), "*"))
elif io == "hdf":
pytest.importorskip("tables")
fn = str(tmpdir.join("h5"))
> ddf0.to_hdf(fn, "/data*")
dask/tests/test_distributed.py:370:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/dataframe/core.py:1620: in to_hdf
return to_hdf(self, path_or_buf, key, mode, append, **kwargs)
dask/dataframe/io/hdf.py:251: in to_hdf
compute_as_if_collection(
dask/base.py:319: in compute_as_if_collection
return schedule(dsk2, keys, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:2167: in gather
return self.sync(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:311: in sync
return sync(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:378: in sync
raise exc.with_traceback(tb)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:351: in f
result = yield future
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tornado/gen.py:762: in run
value = future.result()
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
dask/dataframe/io/hdf.py:27: in _pd_to_hdf
pd_to_hdf(*args, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/core/generic.py:2606: in to_hdf
pytables.to_hdf(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:277: in to_hdf
with HDFStore(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:561: in __init__
self.open(mode=mode, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:710: in open
self._handle = tables.open_file(self._path, self._mode, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tables/file.py:300: in open_file
return File(filename, mode, title, root_uep, filters, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tables/file.py:750: in __init__
self._g_new(filename, mode, **params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E tables.exceptions.HDF5ExtError: HDF5 error back trace
E
E File "H5F.c", line 620, in H5Fopen
E unable to open file
E File "H5VLcallback.c", line 3502, in H5VL_file_open
E failed to iterate over available VOL connector plugins
E File "H5PLpath.c", line 579, in H5PL__path_table_iterate
E can't iterate over plugins in plugin path '(null)'
E File "H5PLpath.c", line 620, in H5PL__path_table_iterate_process_path
E can't open directory: /Users/runner/miniconda3/envs/test-environment/lib/hdf5/plugin
E File "H5VLcallback.c", line 3351, in H5VL__file_open
E open failed
E File "H5VLnative_file.c", line 97, in H5VL__native_file_open
E unable to open file
E File "H5Fint.c", line 1898, in H5F_open
E unable to lock the file
E File "H5FD.c", line 1625, in H5FD_lock
E driver lock request failed
E File "H5FDsec2.c", line 1002, in H5FD__sec2_lock
E unable to lock file, errno = 35, error message = 'Resource temporarily unavailable'
E
E End of HDF5 error back trace
E
E Unable to open/create file '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/popen-gw0/test_blockwise_dataframe_io_Tr3/h5'
tables/hdf5extension.pyx:486: HDF5ExtError
Log from failing test: https://github.com/dask/dask/runs/5558289646?check_suite_focus=true#step:6:21912
Just to note, a next step here would be to try to reproduce this on an intel-mac. :)
@mrocklin ran into a similar failure for test test_blockwise_dataframe_io[True-None-hdf] (note the pytest parametrize values are slightly different)
We might be getting close to the time when we can just drop support for pytables maybe?
Maybe relevant ( https://github.com/PyTables/PyTables/issues/888#issuecomment-790198987 )?
test_blockwise_dataframe_io[True-None-hdf] (same test as @mrocklin) is failing on ubuntu, python-3.10 now:
- #9127, this build
- #9128, this build
As @pavithraes notes, this has been happening with regularity on python3.10+ubuntu for the last week+. I see there was a new release of h5py that happened around the same time.
Interestingly, we are using a Lock in the test, but it seems that the filesystem lock doesn't always agree.
As noted in the comment above ( https://github.com/dask/dask/issues/8816#issuecomment-1083850276 ), this may come down to defining an environment variable. Copying to be explicit:
HDF5_USE_FILE_LOCKING=FALSE
Yes, I was about to open a PR testing that. I'm concerned that it came up at all, though, in the context of using a dask lock. It seems that the filesystem locking has a different perspective on things.
We can add that environment variable and say "we trust our lock more", of course, but I was hoping to understand what's going on a bit more.
Yep that makes sense
That said, Idk CI tends to run on slower systems where things like locking might take a while. Issues around file locking (particularly on Windows w/CI) have been an annoyance in conda/conda-build for years and spurred many changes to workaroud them.
Add in the fact we have two mechanisms here doing the locking, we are bound to have issues. Dropping one in CI seems pretty defensible
If we are seeing this outside CI on a reasonably performant system, that might be worth more exploration
Fair enough -- I have opened #9154, if that seems to improve things, I'm happy to move on
PR ( https://github.com/dask/dask/pull/9154 ) has been merged. Do we want to close or do we want to wait for a bit to see how things go (note we can always reopen if new problems arise)?
I'll keep an eye on it tomorrow, and if things seem better by EOD, I can close it then
Hmm...looks like the issue persists ( https://github.com/dask/dask/pull/9161#issuecomment-1146213734 )
I took a look at the last month of CI runs on main, and this has happened twice in the last month:
https://github.com/dask/dask/runs/7291695922?check_suite_focus=true
https://github.com/dask/dask/runs/7043653348?check_suite_focus=true
There was also an hdf-related segfault a few days ago which may be related: https://github.com/dask/dask/runs/7364095757?check_suite_focus=true
This is mostly a note to myself that this is still happening (though with some reduced frequency?)