polars icon indicating copy to clipboard operation
polars copied to clipboard

df.to_pandas() function converts string to object format instead of string dtype

Open Chuck321123 opened this issue 1 year ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example


import pandas as pd
import polars as pl

# Creating a Polars DataFrame
data = {'ID': [1, 2, 3],
        'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}

df = pl.DataFrame(data)

# Converting the "Name" column to Utf8 (string) data type
df = df.with_columns(df['Name'].cast(pl.String))

# Dtypes for polars
print(f"Dtypes for polars: {df.dtypes}")

df=df.to_pandas()

# Dtypes for pandas
print(f"Dtypes for pandas: {df.dtypes}")

Log output

No response

Issue description

So the title pretty much says it all. Instead of getting the string dtype when converting to pandas, we get the object format, which is not ideal for larger dfs

Expected behavior

That we get the string dtype instead

Installed versions

--------Version info---------
Polars:               0.20.4
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 07:53:56) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Chuck321123 avatar Jan 13 '24 11:01 Chuck321123

That is pyarrow default behaviour for pandas conversion (which we use for this). If you want a faster and more accurate conversion (including a much-improved string type) you need to opt-in to pandas' newer arrow-backed dtypes, like so:

pd_df = df.to_pandas(use_pyarrow_extension_array=True)
pd_df.dtypes
# ID             int64[pyarrow]
# Name    large_string[pyarrow]
# Age            int64[pyarrow]

Once the pandas ecosystem has more fully adopted the arrow dtypes (which will take a bit of time) this will likely become the default conversion path. For now it requires opt-in.

alexander-beedie avatar Jan 13 '24 12:01 alexander-beedie

This issue should be closed because it's not a bug - pandas doesn't have a dedicated string dtype when using NumPy arrays as the backing, even if you literally pass it a NumPy string array:

>>> pd.DataFrame({'a': np.array(['a', 'b'], dtype='U1')}).dtypes
a    object
dtype: object

Wainberg avatar Jan 13 '24 19:01 Wainberg