polars
polars copied to clipboard
Raise an error when constructing a Series or DataFrame with mixed types (e.g. string + number)
Description
I recently found a bug in my own code where I constructed a DataFrame with a mix of integers and strings, and the integers got set to null. Here's a simple illustration:
>>> pl.Series([1, '2'])
shape: (2,)
Series: '' [str]
[
null
"2"
]
>>> pl.DataFrame([1, '2'])
column_0
null
2
shape: (2, 1)
Three other options here are to 1) convert everything to dtype=object (pandas's solution, but highly inefficient), 2) automatically upcast everything to a string, and 3) raise an error. I'm a big fan of raising an error here and letting the user decide whether they want to convert the integers to strings, set them to null, or take some other action.
One of the beautiful things about polars is that it makes it much harder to accidentally introduce missing values than pandas, where pretty much every operation does an implicit outer join! Avoiding implicit conversions to null during Series/DataFrame construction would further reduce the potential for missing value-related bugs.
Edit: this also happens here:
>>> pl.Series([1, 2, 3], dtype=pl.String)
shape: (3,)
Series: '' [str]
[
null
null
null
]
pandas converts to string in this situation:
>>> pd.Series([1, 2, 3], dtype=str)[0]
'1'
I think this is a very similar issue to this: https://github.com/pola-rs/polars/issues/11009.
We really should do a pass on the Python -> Polars parsing to make it more restrictive by default, instead of silently casting/nulling/truncating values.
@stinodego thoughts on polars's behavior of auto-converting pl.Series([1, '2']) to pl.Series([None, '2'])? I'd argue this should be an error.
It should either raise or cast to string, not sure which.
@stinodego I would be in favour of raising an error.
I'm also in favor of raising an error. If the developers are in agreement, could you accept this issue?
Closing in favor of https://github.com/pola-rs/polars/issues/14427