dask-histogram
dask-histogram copied to clipboard
KeyError when 'dask_histogram.boost.Histogram().Fill()' with dask dataframe
Dear experts,
I am starting to use dask and dask_histogram, but I am facing an error when I want to fill a dask_histogram.boost with a dataframe as below:
import numpy as np
import dask.dataframe as dd
import dask_histogram.boost as dhb
# this is reproducible
d = {
'A': np.random.normal(0., 1., 100000),
'W': np.random.uniform(0.2, 0.8, 100000),
}
ddf = dd.from_dict(d, npartitions=10)
h = dhb.Histogram(
dhb.axis.Regular(10, -3, 3),
storage=dhb.storage.Weight()
).fill(ddf['A'], weight=ddf['W']).compute()
print(h)
This example gives me :
Traceback (most recent call last):
File "/gpfs/home/belle2/rlebouch/darkphotontodimuons/background_rejection/testdask.py", line 15, in <module>
).fill(ddf['A'], weight=ddf['W']).compute()
^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 372, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 653, in compute
dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 422, in collections_to_dsk
dsk = opt(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask_histogram/core.py", line 514, in optimize
dsk = fuse_roots(dsk, keys=keys) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 1564, in fuse_roots
new = toolz.merge(layer, *[layers[dep] for dep in deps])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/toolz/dicttoolz.py", line 39, in merge
rv.update(d)
File "<frozen _collections_abc>", line 836, in __iter__
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 641, in __iter__
return iter(self._dict)
^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 607, in _dict
dsk = _make_blockwise_graph(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 958, in _make_blockwise_graph
itertools.product(*[range(dims[i]) for i in out_indices])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 958, in <listcomp>
itertools.product(*[range(dims[i]) for i in out_indices])
~~~~^^^
KeyError: '.0'
Is It really possible to fill a histogram from a data frame?
I currently use: Name: dask-histogram Version: 2024.12.1
Name: dask Version: 2024.12.1
Name: boost_histogram Version: 1.4.1
This problem stems from the new dask.dataframe backend that is based on dask-expr; dask-histogram isn't compatible at this time. More info here: https://github.com/dask-contrib/dask-histogram/pull/130
The code will work with the Dask config environment variable DASK_DATAFRAME__QUERY_PLANNING=False or with dask.config.set("dataframe.query-planning", False) in Python code.
I added your suggestion to my code, but it solved nothing, and I still have the same error message.
Can you share more details? Did you export the environment variable or use the dask.config API?
I tried with the dask.config AP
Hmm yeah I can only make it work with the env variable but not with the config; maybe it's an artifact of mixing dask-histogram & dask.dataframe, I'm not sure. That's probably another independent issue. But anyway, this is the workaround for now:
~/software/repos/dask-histogram main ❯ DASK_DATAFRAME__QUERY_PLANNING=False ipython 22s 3.12.8 (dask-histogram) gitddavisdev 19:53:58
Python 3.12.8 (main, Dec 3 2024, 18:42:41) [Clang 16.0.0 (clang-1600.0.26.4)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
...: import dask.dataframe as dd
...: import dask_histogram.boost as dhb
...:
...: # this is reproducible
...: d = {
...: 'A': np.random.normal(0., 1., 100000),
...: 'W': np.random.uniform(0.2, 0.8, 100000),
...: }
...: ddf = dd.from_dict(d, npartitions=10)
...:
...: h = dhb.Histogram(
...: dhb.axis.Regular(10, -3, 3),
...: storage=dhb.storage.Weight()
...: ).fill(ddf['A'], weight=ddf['W']).compute()
...: print(h)
/Users/ddavis/software/repos/dask-histogram/.venv/lib/python3.12/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
warnings.warn(
┌─────────────────────────────────────────────────────┐
[-inf, -3) 66.92 │▎ │
[ -3, -2.4) 357.9 │█▋ │
[-2.4, -1.8) 1391 │██████▍ │
[-1.8, -1.2) 3997 │██████████████████▎ │
[-1.2, -0.6) 7929 │████████████████████████████████████▎ │
[-0.6, 0) 1.139e+04 │████████████████████████████████████████████████████ │
[ 0, 0.6) 1.111e+04 │██████████████████████████████████████████████████▊ │
[ 0.6, 1.2) 8052 │████████████████████████████████████▊ │
[ 1.2, 1.8) 3914 │█████████████████▉ │
[ 1.8, 2.4) 1368 │██████▎ │
[ 2.4, 3) 324.1 │█▌ │
[ 3, inf) 63.99 │▎ │
└─────────────────────────────────────────────────────┘