dask icon indicating copy to clipboard operation
dask copied to clipboard

`test_blockwise_dataframe_io[True-False-hdf]` is flaky on osx CI

Open jcrist opened this issue 4 years ago • 15 comments

I noticed this test periodically failing due to a pytables locking issue today (may have started earlier).

Traceback
_________________ test_blockwise_dataframe_io[True-False-hdf] __________________
[gw0] darwin -- Python 3.8.12 /Users/runner/miniconda3/envs/test-environment/bin/python

c = <Client: 'tcp://127.0.0.1:50922' processes=2 threads=2, memory=28.00 GiB>
tmpdir = local('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/popen-gw0/test_blockwise_dataframe_io_Tr3')
io = 'hdf', fuse = False, from_futures = True

    @pytest.mark.filterwarnings(
        "ignore:Running on a single-machine scheduler when a distributed client "
        "is active might lead to unexpected results."
    )
    @pytest.mark.parametrize(
        "io",
        ["parquet-pyarrow", "parquet-fastparquet", "csv", "hdf"],
    )
    @pytest.mark.parametrize("fuse", [True, False, None])
    @pytest.mark.parametrize("from_futures", [True, False])
    def test_blockwise_dataframe_io(c, tmpdir, io, fuse, from_futures):
        pd = pytest.importorskip("pandas")
        dd = pytest.importorskip("dask.dataframe")
    
        df = pd.DataFrame({"x": [1, 2, 3] * 5, "y": range(15)})
    
        if from_futures:
            parts = [df.iloc[:5], df.iloc[5:10], df.iloc[10:15]]
            futs = c.scatter(parts)
            ddf0 = dd.from_delayed(futs, meta=parts[0])
        else:
            ddf0 = dd.from_pandas(df, npartitions=3)
    
        if io.startswith("parquet"):
            if io == "parquet-pyarrow":
                pytest.importorskip("pyarrow.parquet")
                engine = "pyarrow"
            else:
                pytest.importorskip("fastparquet")
                engine = "fastparquet"
            ddf0.to_parquet(str(tmpdir), engine=engine)
            ddf = dd.read_parquet(str(tmpdir), engine=engine)
        elif io == "csv":
            ddf0.to_csv(str(tmpdir), index=False)
            ddf = dd.read_csv(os.path.join(str(tmpdir), "*"))
        elif io == "hdf":
            pytest.importorskip("tables")
            fn = str(tmpdir.join("h5"))
>           ddf0.to_hdf(fn, "/data*")

dask/tests/test_distributed.py:370: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/dataframe/core.py:1620: in to_hdf
    return to_hdf(self, path_or_buf, key, mode, append, **kwargs)
dask/dataframe/io/hdf.py:251: in to_hdf
    compute_as_if_collection(
dask/base.py:319: in compute_as_if_collection
    return schedule(dsk2, keys, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:3015: in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:2167: in gather
    return self.sync(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:311: in sync
    return sync(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:378: in sync
    raise exc.with_traceback(tb)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/utils.py:351: in f
    result = yield future
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tornado/gen.py:762: in run
    value = future.result()
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/distributed/client.py:2030: in _gather
    raise exception.with_traceback(traceback)
dask/dataframe/io/hdf.py:27: in _pd_to_hdf
    pd_to_hdf(*args, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/core/generic.py:2606: in to_hdf
    pytables.to_hdf(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:277: in to_hdf
    with HDFStore(
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:561: in __init__
    self.open(mode=mode, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/pandas/io/pytables.py:710: in open
    self._handle = tables.open_file(self._path, self._mode, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tables/file.py:300: in open_file
    return File(filename, mode, title, root_uep, filters, **kwargs)
../../../miniconda3/envs/test-environment/lib/python3.8/site-packages/tables/file.py:750: in __init__
    self._g_new(filename, mode, **params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   tables.exceptions.HDF5ExtError: HDF5 error back trace
E   
E     File "H5F.c", line 620, in H5Fopen
E       unable to open file
E     File "H5VLcallback.c", line 3502, in H5VL_file_open
E       failed to iterate over available VOL connector plugins
E     File "H5PLpath.c", line 579, in H5PL__path_table_iterate
E       can't iterate over plugins in plugin path '(null)'
E     File "H5PLpath.c", line 620, in H5PL__path_table_iterate_process_path
E       can't open directory: /Users/runner/miniconda3/envs/test-environment/lib/hdf5/plugin
E     File "H5VLcallback.c", line 3351, in H5VL__file_open
E       open failed
E     File "H5VLnative_file.c", line 97, in H5VL__native_file_open
E       unable to open file
E     File "H5Fint.c", line 1898, in H5F_open
E       unable to lock the file
E     File "H5FD.c", line 1625, in H5FD_lock
E       driver lock request failed
E     File "H5FDsec2.c", line 1002, in H5FD__sec2_lock
E       unable to lock file, errno = 35, error message = 'Resource temporarily unavailable'
E   
E   End of HDF5 error back trace
E   
E   Unable to open/create file '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/popen-gw0/test_blockwise_dataframe_io_Tr3/h5'

tables/hdf5extension.pyx:486: HDF5ExtError

Log from failing test: https://github.com/dask/dask/runs/5558289646?check_suite_focus=true#step:6:21912

jcrist avatar Mar 15 '22 20:03 jcrist

Just to note, a next step here would be to try to reproduce this on an intel-mac. :)

pavithraes avatar Mar 30 '22 16:03 pavithraes

@mrocklin ran into a similar failure for test test_blockwise_dataframe_io[True-None-hdf] (note the pytest parametrize values are slightly different)

jrbourbeau avatar Mar 30 '22 19:03 jrbourbeau

We might be getting close to the time when we can just drop support for pytables maybe?

mrocklin avatar Mar 30 '22 19:03 mrocklin

Maybe relevant ( https://github.com/PyTables/PyTables/issues/888#issuecomment-790198987 )?

jakirkham avatar Mar 31 '22 00:03 jakirkham

test_blockwise_dataframe_io[True-None-hdf] (same test as @mrocklin) is failing on ubuntu, python-3.10 now:

pavithraes avatar May 25 '22 21:05 pavithraes

As @pavithraes notes, this has been happening with regularity on python3.10+ubuntu for the last week+. I see there was a new release of h5py that happened around the same time.

Interestingly, we are using a Lock in the test, but it seems that the filesystem lock doesn't always agree.

ian-r-rose avatar Jun 01 '22 19:06 ian-r-rose

As noted in the comment above ( https://github.com/dask/dask/issues/8816#issuecomment-1083850276 ), this may come down to defining an environment variable. Copying to be explicit:

HDF5_USE_FILE_LOCKING=FALSE

jakirkham avatar Jun 01 '22 20:06 jakirkham

Yes, I was about to open a PR testing that. I'm concerned that it came up at all, though, in the context of using a dask lock. It seems that the filesystem locking has a different perspective on things.

We can add that environment variable and say "we trust our lock more", of course, but I was hoping to understand what's going on a bit more.

ian-r-rose avatar Jun 01 '22 20:06 ian-r-rose

Yep that makes sense

That said, Idk CI tends to run on slower systems where things like locking might take a while. Issues around file locking (particularly on Windows w/CI) have been an annoyance in conda/conda-build for years and spurred many changes to workaroud them.

Add in the fact we have two mechanisms here doing the locking, we are bound to have issues. Dropping one in CI seems pretty defensible

If we are seeing this outside CI on a reasonably performant system, that might be worth more exploration

jakirkham avatar Jun 01 '22 20:06 jakirkham

Fair enough -- I have opened #9154, if that seems to improve things, I'm happy to move on

ian-r-rose avatar Jun 01 '22 20:06 ian-r-rose

PR ( https://github.com/dask/dask/pull/9154 ) has been merged. Do we want to close or do we want to wait for a bit to see how things go (note we can always reopen if new problems arise)?

jakirkham avatar Jun 03 '22 00:06 jakirkham

I'll keep an eye on it tomorrow, and if things seem better by EOD, I can close it then

ian-r-rose avatar Jun 03 '22 00:06 ian-r-rose

Hmm...looks like the issue persists ( https://github.com/dask/dask/pull/9161#issuecomment-1146213734 )

jakirkham avatar Jun 03 '22 17:06 jakirkham

Hmm...looks like the issue persists ( #9161 (comment) )

Shucks

ian-r-rose avatar Jun 03 '22 18:06 ian-r-rose

I took a look at the last month of CI runs on main, and this has happened twice in the last month: https://github.com/dask/dask/runs/7291695922?check_suite_focus=true https://github.com/dask/dask/runs/7043653348?check_suite_focus=true

There was also an hdf-related segfault a few days ago which may be related: https://github.com/dask/dask/runs/7364095757?check_suite_focus=true

This is mostly a note to myself that this is still happening (though with some reduced frequency?)

ian-r-rose avatar Jul 20 '22 15:07 ian-r-rose