dask icon indicating copy to clipboard operation
dask copied to clipboard

`.str.split` coerces `dtype` to `object`

Open noahblakesmith opened this issue 9 months ago • 2 comments

Describe the issue:

The .str.split method coerces dtype from string to object. This behavior is also inconsistent with pandas (if that matters).

Minimal complete verifiable example:

Code

import dask.dataframe as dd
import pandas as pd

data = {"c": ["a,b,c", "d,e,f", "g,h,i"]}

# `pandas`
df = pd.DataFrame(data, dtype="string[pyarrow]")
print(df.dtypes)
print(df["c"].str.split(",", n=1, expand=True).dtypes)

# `dask`
ddf = dd.from_pandas(df)
print(ddf.dtypes)
print(ddf["c"].str.split(",", n=1, expand=True).dtypes)

Output

c    string[pyarrow]
dtype: object
0    string[pyarrow]
1    string[pyarrow]
dtype: object
c    string[pyarrow]
dtype: object
0    object
1    object
dtype: object

Environment:

  • Dask version: 2025.3.0
  • Python version: 3.10.16
  • Operating System: Ubuntu 24.04.2 LTS
  • Install method (conda, pip, source): pip

noahblakesmith avatar Apr 11 '25 07:04 noahblakesmith

Note that after computing, the dtypes are correct:

In [6]: ddf["c"].str.split(",", n=1, expand=True).compute().dtypes
Out[6]: 
0    string[pyarrow]
1    string[pyarrow]
dtype: object

so it's just the _meta after .str.split. https://github.com/dask/dask/blob/0fa5e18d511c49f1a9cd5f98c675a9f6cd2fc02f/dask/dataframe/dask_expr/_str_accessor.py#L186-L189 looks a bit suspicious. I haven't stepped through it, but it seems like we're doing pd.Series(['list', 'of', 'strings']) which pandas will infer as object. Maybe we need to pass a dtype there? Mind taking a look?

TomAugspurger avatar Apr 11 '25 20:04 TomAugspurger

Hey! I’ve prepared a fix for this issue and confirmed that it passes tests (including preservation of string[pyarrow] dtype). I’ll open a PR shortly

Tunahanyrd avatar May 19 '25 10:05 Tunahanyrd