datashader
datashader copied to clipboard
Add Python 3.12 support
With numba now supporting Python 3.12, we should also add support to it.
We need conda-build, to be released on conda-forge before we can solve on conda.
Builds on top of https://github.com/holoviz/datashader/pull/1314
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 85.60%. Comparing base (
09da5a0
) to head (8b5d3d5
).
:exclamation: Current head 8b5d3d5 differs from pull request most recent head 67d73f5. Consider uploading reports for the commit 67d73f5 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #1317 +/- ##
==========================================
- Coverage 85.63% 85.60% -0.04%
==========================================
Files 51 51
Lines 11286 11287 +1
==========================================
- Hits 9665 9662 -3
- Misses 1621 1625 +4
Tests passed, but get this warning in Python 3.12
/home/runner/work/datashader/datashader/datashader/datashape/lexer.py:20: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
return ast.parse('u' + s).body[0].value.s
Things I have done in this PR, other than upgrading to Python 3.12.
- I have updated the test CI to align more with our other repos.
- Don't use the new dask query-planner /
dask-expr
as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:- This line is no longer valid with
dask-expr
(likely related to this issue: https://github.com/dask/dask-expr/issues/952) https://github.com/holoviz/datashader/blob/09da5a05d3e49071ac3ce841e7ceccd4d95759fb/datashader/data_libraries/dask.py#L147 - It does not have
pack_partitions
in the8_Polygons.ipynb
notebook (this seem easier to fix).
- This line is no longer valid with
Things I have done in this PR, other than upgrading to Python 3.12.
I have updated the test CI to align more with our other repos.
Don't use the new dask query-planner /
dask-expr
as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:
- This line is no longer valid with
dask-expr
(likely related to this issue: dask_keys and future identifiers are different dask/dask-expr#952) https://github.com/holoviz/datashader/blob/09da5a05d3e49071ac3ce841e7ceccd4d95759fb/datashader/data_libraries/dask.py#L147- It does not have
pack_partitions
in the8_Polygons.ipynb
notebook (this seem easier to fix).
Do you have a reproducer for the invalid line?
Searching that file for the text key
brings up a couple places where they're doing explicit Dask things, calling schedule(dsk, keys)
and constructing low level task graphs.
One option would be to convert to a legacy dask dataframe, where all of those things would continue working, at least until we entirely remove that system.
A MRE for the dask keys could be this:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import datashader as ds
df = pd.DataFrame(
{
"x": np.array(([0.0] * 10 + [1] * 10)),
"y": np.array(([0.0] * 5 + [1] * 5 + [0] * 5 + [1] * 5)),
"i32": np.arange(20, dtype="i4"),
}
)
ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.repartition(npartitions=2) # If I comment out this lines it does not error.
c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1), y_range=(0, 1))
c.points(ddf, "x", "y", ds.count("i32"))
thx @Hoxbro
The issue at play here is as follows:
df.__dask_graph__()
df.__dask_keys__()
both change the expression (a optimization step), which makes the keys in the graph no longer match df._name
that you pass into da.Array, which causes the exception.
There are different ways of addressing this:
- you can optimise the DataFrame before extracting the graph, e.g.
df = df.optimize()
before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103 - You might be able to convert the DataFrame directly to a Dask Array, e.g.
df.values
ordf.to_dask_array()
, this takes care of everything under the hood, but I guess there is probably a reason that you are doing things differently at the moment?
you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103
df = df.optimize()
line makes a lot of the red test green again :+1:
I will push it and see how the CI likes it.
However, I can see at least one problem still exists, where dask_expr
now takes "precedence" over the custom implementation of the dask dataframe. Do you also have a one-liner for that?
import dask
dask.config.set({"dataframe.query-planning": True})
import geopandas as gpd
import dask.dataframe as dd
import spatialpandas as sp
from geodatasets import get_path
world = sp.GeoDataFrame(gpd.read_file(get_path("nybb")))
type(dd.from_pandas(world, npartitions=2)) # dask_expr._collection.DataFrame
Edit: I ran one cell, restarted the kernel and then ran the second.
There seem to be one other type of failing tests:
import dask.dataframe as dd
import pandas as pd
import datashader as ds
df = pd.DataFrame({"x": [0] * 5, "y": [0] * 5})
ddf = dd.from_pandas(df, npartitions=2)
cvs = ds.Canvas(plot_width=10, plot_height=10)
agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 9
6 ddf = dd.from_pandas(df, npartitions=2)
8 cvs = ds.Canvas(plot_width=10, plot_height=10)
----> 9 agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())
File [~/projects/holoviz/repos/datashader/datashader/core.py:472](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=471), in Canvas.line(self, source, x, y, agg, axis, geometry, line_width, antialias)
464 if not isinstance(non_cat_agg, (
465 rd.any, rd.count, rd.max, rd.min, rd.sum, rd.summary, rd._sum_zero,
466 rd._first_or_last, rd.mean, rd.max_n, rd.min_n, rd._first_n_or_last_n,
467 rd._max_or_min_row_index, rd._max_n_or_min_n_row_index, rd.where,
468 )):
469 raise NotImplementedError(
470 f"{type(non_cat_agg)} reduction not implemented for antialiased lines")
--> 472 return bypixel(source, self, glyph, agg, antialias=glyph.antialiased)
File [~/projects/holoviz/repos/datashader/datashader/core.py:1335](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=1334), in bypixel(source, canvas, glyph, agg, antialias)
1333 with warnings.catch_warnings():
1334 warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
-> 1335 return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)
File [~/projects/holoviz/repos/datashader/datashader/utils.py:114](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/utils.py#line=113), in Dispatcher.__call__(self, head, *rest, **kwargs)
112 typ = type(head)
113 if typ in lk:
--> 114 return lk[typ](head, *rest, **kwargs)
115 for cls in getmro(typ)[1:]:
116 if cls in lk:
File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:51](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=50), in dask_pipeline(df, schema, canvas, glyph, summary, antialias, cuda)
48 graph = df.__dask_graph__()
50 dsk.update(optimize(graph, keys))
---> 51 return scheduler(dsk, name)
File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py:90](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py#line=89), in get(dsk, keys, cache, num_workers, pool, **kwargs)
87 elif isinstance(pool, multiprocessing.pool.Pool):
88 pool = MultiprocessingPoolExecutor(pool)
---> 90 results = get_async(
91 pool.submit,
92 pool._max_workers,
93 dsk,
94 keys,
95 cache=cache,
96 get_id=_thread_get_id,
97 pack_exception=pack_exception,
98 **kwargs,
99 )
101 # Cleanup pools associated to dead threads
102 with pools_lock:
File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:512](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=511), in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
510 _execute_task(task, data) # Re-execute locally
511 else:
--> 512 raise_exception(exc, tb)
513 res, worker_id = loads(res_info)
514 state["cache"][key] = res
File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:320](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=319), in reraise(exc, tb)
318 if exc.__traceback__ is not tb:
319 raise exc.with_traceback(tb)
--> 320 raise exc
File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:225](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=224), in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
223 try:
224 task, data = loads(task_info)
--> 225 result = _execute_task(task, data)
226 id = get_id()
227 result = dumps((result, id))
File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py:127](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py#line=126), in _execute_task(arg, cache, dsk)
123 func, args = arg[0], arg[1:]
124 # Note: Don't assign the subtask results to a variable. numpy detects
125 # temporaries by their reference count and can execute certain
126 # operations in-place.
--> 127 return func(*(_execute_task(a, cache) for a in args))
128 elif not ishashable(arg):
129 return arg
File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:245](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=244), in line.<locals>.chunk(df, df2)
243 plot_start = True
244 if df2 is not None:
--> 245 df = concat([df.iloc[-1:], df2])
246 plot_start = False
247 aggs = create(shape)
AttributeError: 'tuple' object has no attribute 'iloc'
That's not as trivial as this one. There are 2 options:
-
The ideal solution would be to make your
DaskGeoDataFrame
compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet. -
You can hack around this to always use legacy frame when someone calls the constructor of
DaskGeoDataFrame
, but this is only a short term solution.
What's the priority to make this passing?
The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.
I'll have a look at the geopandas PR and see if I can crib from them. In the long run we definitely want full compatibility but will make a call whether to release with a short term patch based on the effort required.
I think the only test failing now is related to SpatialPandas Dask, which likely needs to be updated in spatialpandas itself. Because of this, the environment variable is needed for now.
The last failing test outside of that was related to attribute assignment being ignored with dask-expr, whereas it would work with classic dask.dataframe; see an MRE below. It seems to be related to https://github.com/dask/dask-expr/issues/785, and can open an issue upstream if needed.
import dask
import importlib
import pandas as pd
import dask.dataframe as dd
for b in [True, False]:
print("\ndataframe.query-planning =", b)
dask.config.set({"dataframe.query-planning": b})
dd = importlib.reload(dd)
df = pd.DataFrame(data=dict(cat1=["a", "b", "c", "a", "b"], cat2=["a", "b", "c", "a", "b"]))
ddf = dd.from_pandas(df, npartitions=2)
ddf["cat1"] = ddf["cat1"].astype("category")
ddf.cat2 = ddf.cat2.astype("category")
print(ddf.dtypes)
this was fixed on Friday fortunately
Wow, thanks everybody! It takes a village!