polars
polars copied to clipboard
df.to_pandas() function converts string to object format instead of string dtype
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import pandas as pd
import polars as pl
# Creating a Polars DataFrame
data = {'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
df = pl.DataFrame(data)
# Converting the "Name" column to Utf8 (string) data type
df = df.with_columns(df['Name'].cast(pl.String))
# Dtypes for polars
print(f"Dtypes for polars: {df.dtypes}")
df=df.to_pandas()
# Dtypes for pandas
print(f"Dtypes for pandas: {df.dtypes}")
Log output
No response
Issue description
So the title pretty much says it all. Instead of getting the string dtype when converting to pandas, we get the object format, which is not ideal for larger dfs
Expected behavior
That we get the string dtype instead
Installed versions
--------Version info---------
Polars: 0.20.4
Index type: UInt32
Platform: Windows-10-10.0.19045-SP0
Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 07:53:56) [MSC v.1937 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 2.2.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.2
numpy: 1.26.3
openpyxl: 3.1.2
pandas: 2.1.4
pyarrow: 14.0.2
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
That is pyarrow
default behaviour for pandas conversion (which we use for this). If you want a faster and more accurate conversion (including a much-improved string type) you need to opt-in to pandas' newer arrow-backed dtypes, like so:
pd_df = df.to_pandas(use_pyarrow_extension_array=True)
pd_df.dtypes
# ID int64[pyarrow]
# Name large_string[pyarrow]
# Age int64[pyarrow]
Once the pandas ecosystem has more fully adopted the arrow dtypes (which will take a bit of time) this will likely become the default conversion path. For now it requires opt-in.
This issue should be closed because it's not a bug - pandas doesn't have a dedicated string dtype when using NumPy arrays as the backing, even if you literally pass it a NumPy string array:
>>> pd.DataFrame({'a': np.array(['a', 'b'], dtype='U1')}).dtypes
a object
dtype: object