datashader icon indicating copy to clipboard operation
datashader copied to clipboard

Add Python 3.12 support

Open hoxbro opened this issue 4 months ago • 2 comments

With numba now supporting Python 3.12, we should also add support to it.

We need conda-build, to be released on conda-forge before we can solve on conda.

Builds on top of https://github.com/holoviz/datashader/pull/1314

hoxbro avatar Feb 08 '24 07:02 hoxbro

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.60%. Comparing base (09da5a0) to head (8b5d3d5).

:exclamation: Current head 8b5d3d5 differs from pull request most recent head 67d73f5. Consider uploading reports for the commit 67d73f5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1317      +/-   ##
==========================================
- Coverage   85.63%   85.60%   -0.04%     
==========================================
  Files          51       51              
  Lines       11286    11287       +1     
==========================================
- Hits         9665     9662       -3     
- Misses       1621     1625       +4     

codecov[bot] avatar Feb 08 '24 08:02 codecov[bot]

Tests passed, but get this warning in Python 3.12

  /home/runner/work/datashader/datashader/datashader/datashape/lexer.py:20: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
    return ast.parse('u' + s).body[0].value.s

hoxbro avatar Feb 08 '24 09:02 hoxbro

Things I have done in this PR, other than upgrading to Python 3.12.

  • I have updated the test CI to align more with our other repos.
  • Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:
    • This line is no longer valid with dask-expr (likely related to this issue: https://github.com/dask/dask-expr/issues/952) https://github.com/holoviz/datashader/blob/09da5a05d3e49071ac3ce841e7ceccd4d95759fb/datashader/data_libraries/dask.py#L147
    • It does not have pack_partitions in the 8_Polygons.ipynb notebook (this seem easier to fix).

hoxbro avatar Apr 04 '24 16:04 hoxbro

Things I have done in this PR, other than upgrading to Python 3.12.

  • I have updated the test CI to align more with our other repos.

  • Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:

    • This line is no longer valid with dask-expr (likely related to this issue: dask_keys and future identifiers are different dask/dask-expr#952) https://github.com/holoviz/datashader/blob/09da5a05d3e49071ac3ce841e7ceccd4d95759fb/datashader/data_libraries/dask.py#L147
    • It does not have pack_partitions in the 8_Polygons.ipynb notebook (this seem easier to fix).

Do you have a reproducer for the invalid line?

phofl avatar Apr 05 '24 14:04 phofl

Searching that file for the text key brings up a couple places where they're doing explicit Dask things, calling schedule(dsk, keys) and constructing low level task graphs.

One option would be to convert to a legacy dask dataframe, where all of those things would continue working, at least until we entirely remove that system.

mrocklin avatar Apr 05 '24 14:04 mrocklin

A MRE for the dask keys could be this:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import datashader as ds

df = pd.DataFrame(
    {
        "x": np.array(([0.0] * 10 + [1] * 10)),
        "y": np.array(([0.0] * 5 + [1] * 5 + [0] * 5 + [1] * 5)),
        "i32": np.arange(20, dtype="i4"),
    }
)

ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.repartition(npartitions=2)  # If I comment out this lines it does not error. 
c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1), y_range=(0, 1))
c.points(ddf, "x", "y", ds.count("i32"))

hoxbro avatar Apr 05 '24 14:04 hoxbro

thx @Hoxbro

The issue at play here is as follows:

df.__dask_graph__()
df.__dask_keys__()

both change the expression (a optimization step), which makes the keys in the graph no longer match df._name that you pass into da.Array, which causes the exception.

There are different ways of addressing this:

  • you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103
  • You might be able to convert the DataFrame directly to a Dask Array, e.g. df.values or df.to_dask_array(), this takes care of everything under the hood, but I guess there is probably a reason that you are doing things differently at the moment?

phofl avatar Apr 05 '24 14:04 phofl

you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103

df = df.optimize() line makes a lot of the red test green again :+1:

I will push it and see how the CI likes it.

However, I can see at least one problem still exists, where dask_expr now takes "precedence" over the custom implementation of the dask dataframe. Do you also have a one-liner for that?

image

import dask

dask.config.set({"dataframe.query-planning": True})

import geopandas as gpd
import dask.dataframe as dd
import spatialpandas as sp
from geodatasets import get_path

world = sp.GeoDataFrame(gpd.read_file(get_path("nybb")))
type(dd.from_pandas(world, npartitions=2))  # dask_expr._collection.DataFrame

Edit: I ran one cell, restarted the kernel and then ran the second.

hoxbro avatar Apr 05 '24 15:04 hoxbro

There seem to be one other type of failing tests:

import dask.dataframe as dd
import pandas as pd
import datashader as ds

df = pd.DataFrame({"x": [0] * 5, "y": [0] * 5})
ddf = dd.from_pandas(df, npartitions=2)

cvs = ds.Canvas(plot_width=10, plot_height=10)
agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 9
      6 ddf = dd.from_pandas(df, npartitions=2)
      8 cvs = ds.Canvas(plot_width=10, plot_height=10)
----> 9 agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())

File [~/projects/holoviz/repos/datashader/datashader/core.py:472](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=471), in Canvas.line(self, source, x, y, agg, axis, geometry, line_width, antialias)
    464     if not isinstance(non_cat_agg, (
    465         rd.any, rd.count, rd.max, rd.min, rd.sum, rd.summary, rd._sum_zero,
    466         rd._first_or_last, rd.mean, rd.max_n, rd.min_n, rd._first_n_or_last_n,
    467         rd._max_or_min_row_index, rd._max_n_or_min_n_row_index, rd.where,
    468     )):
    469         raise NotImplementedError(
    470             f"{type(non_cat_agg)} reduction not implemented for antialiased lines")
--> 472 return bypixel(source, self, glyph, agg, antialias=glyph.antialiased)

File [~/projects/holoviz/repos/datashader/datashader/core.py:1335](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=1334), in bypixel(source, canvas, glyph, agg, antialias)
   1333 with warnings.catch_warnings():
   1334     warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
-> 1335     return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)

File [~/projects/holoviz/repos/datashader/datashader/utils.py:114](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/utils.py#line=113), in Dispatcher.__call__(self, head, *rest, **kwargs)
    112 typ = type(head)
    113 if typ in lk:
--> 114     return lk[typ](head, *rest, **kwargs)
    115 for cls in getmro(typ)[1:]:
    116     if cls in lk:

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:51](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=50), in dask_pipeline(df, schema, canvas, glyph, summary, antialias, cuda)
     48 graph = df.__dask_graph__()
     50 dsk.update(optimize(graph, keys))
---> 51 return scheduler(dsk, name)

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py:90](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py#line=89), in get(dsk, keys, cache, num_workers, pool, **kwargs)
     87     elif isinstance(pool, multiprocessing.pool.Pool):
     88         pool = MultiprocessingPoolExecutor(pool)
---> 90 results = get_async(
     91     pool.submit,
     92     pool._max_workers,
     93     dsk,
     94     keys,
     95     cache=cache,
     96     get_id=_thread_get_id,
     97     pack_exception=pack_exception,
     98     **kwargs,
     99 )
    101 # Cleanup pools associated to dead threads
    102 with pools_lock:

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:512](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=511), in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    510         _execute_task(task, data)  # Re-execute locally
    511     else:
--> 512         raise_exception(exc, tb)
    513 res, worker_id = loads(res_info)
    514 state["cache"][key] = res

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:320](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=319), in reraise(exc, tb)
    318 if exc.__traceback__ is not tb:
    319     raise exc.with_traceback(tb)
--> 320 raise exc

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:225](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=224), in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    223 try:
    224     task, data = loads(task_info)
--> 225     result = _execute_task(task, data)
    226     id = get_id()
    227     result = dumps((result, id))

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py:127](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py#line=126), in _execute_task(arg, cache, dsk)
    123     func, args = arg[0], arg[1:]
    124     # Note: Don't assign the subtask results to a variable. numpy detects
    125     # temporaries by their reference count and can execute certain
    126     # operations in-place.
--> 127     return func(*(_execute_task(a, cache) for a in args))
    128 elif not ishashable(arg):
    129     return arg

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:245](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=244), in line.<locals>.chunk(df, df2)
    243 plot_start = True
    244 if df2 is not None:
--> 245     df = concat([df.iloc[-1:], df2])
    246     plot_start = False
    247 aggs = create(shape)

AttributeError: 'tuple' object has no attribute 'iloc'

hoxbro avatar Apr 05 '24 16:04 hoxbro

That's not as trivial as this one. There are 2 options:

  • The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.

  • You can hack around this to always use legacy frame when someone calls the constructor of DaskGeoDataFrame, but this is only a short term solution.

What's the priority to make this passing?

phofl avatar Apr 05 '24 17:04 phofl

The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.

I'll have a look at the geopandas PR and see if I can crib from them. In the long run we definitely want full compatibility but will make a call whether to release with a short term patch based on the effort required.

philippjfr avatar Apr 05 '24 17:04 philippjfr

I think the only test failing now is related to SpatialPandas Dask, which likely needs to be updated in spatialpandas itself. Because of this, the environment variable is needed for now.

The last failing test outside of that was related to attribute assignment being ignored with dask-expr, whereas it would work with classic dask.dataframe; see an MRE below. It seems to be related to https://github.com/dask/dask-expr/issues/785, and can open an issue upstream if needed.

import dask
import importlib
import pandas as pd
import dask.dataframe as dd

for b in [True, False]:
    print("\ndataframe.query-planning =", b)
    dask.config.set({"dataframe.query-planning": b})
    dd = importlib.reload(dd)

    df = pd.DataFrame(data=dict(cat1=["a", "b", "c", "a", "b"], cat2=["a", "b", "c", "a", "b"]))
    ddf = dd.from_pandas(df, npartitions=2)

    ddf["cat1"] = ddf["cat1"].astype("category")
    ddf.cat2 = ddf.cat2.astype("category")
    print(ddf.dtypes)

image

hoxbro avatar Apr 08 '24 15:04 hoxbro

this was fixed on Friday fortunately

phofl avatar Apr 08 '24 15:04 phofl

Wow, thanks everybody! It takes a village!

jbednar avatar Apr 09 '24 21:04 jbednar