`.str.split` coerces `dtype` to `object`
Describe the issue:
The .str.split method coerces dtype from string to object. This behavior is also inconsistent with pandas (if that matters).
Minimal complete verifiable example:
Code
import dask.dataframe as dd
import pandas as pd
data = {"c": ["a,b,c", "d,e,f", "g,h,i"]}
# `pandas`
df = pd.DataFrame(data, dtype="string[pyarrow]")
print(df.dtypes)
print(df["c"].str.split(",", n=1, expand=True).dtypes)
# `dask`
ddf = dd.from_pandas(df)
print(ddf.dtypes)
print(ddf["c"].str.split(",", n=1, expand=True).dtypes)
Output
c string[pyarrow]
dtype: object
0 string[pyarrow]
1 string[pyarrow]
dtype: object
c string[pyarrow]
dtype: object
0 object
1 object
dtype: object
Environment:
- Dask version: 2025.3.0
- Python version: 3.10.16
- Operating System: Ubuntu 24.04.2 LTS
- Install method (conda, pip, source): pip
Note that after computing, the dtypes are correct:
In [6]: ddf["c"].str.split(",", n=1, expand=True).compute().dtypes
Out[6]:
0 string[pyarrow]
1 string[pyarrow]
dtype: object
so it's just the _meta after .str.split. https://github.com/dask/dask/blob/0fa5e18d511c49f1a9cd5f98c675a9f6cd2fc02f/dask/dataframe/dask_expr/_str_accessor.py#L186-L189 looks a bit suspicious. I haven't stepped through it, but it seems like we're doing pd.Series(['list', 'of', 'strings']) which pandas will infer as object. Maybe we need to pass a dtype there? Mind taking a look?
Hey! I’ve prepared a fix for this issue and confirmed that it passes tests (including preservation of string[pyarrow] dtype). I’ll open a PR shortly