polars
polars copied to clipboard
[feature request] Handle several null values in null_values CSV parser argument
Since https://github.com/pola-rs/polars/issues/848, polars CSV parser can detect a string corresponding to a null value and replace it during parsing.
It is quite common to have several strings that should be interpreted as nulls (in the same column), which currently polars cannot handle. It would be great if the following behaviour was implemented:
null_values
Values to interpret as null values. You can provide a:
- str -> all values encountered equal to this string will be nulls
- List[str] -> all values encountered equal to these strings will be nulls
- Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.
This would change the behaviour when a List[str] argument is provided; current behaviour could be achieved with a List[List[str]] argument, if desired. But I am not sure when that path would be used in practise: The Dict[str, List[str]] path seems a lot more practical to me.
this would be very useful to replace pandas.read_csv with a lazy scan_csv and still get the same results.
this would be very useful to replace
pandas.read_csvwith a lazyscan_csvand still get the same results.
Note that we support most of these already.
for anyone reading this issue and looking for a way to reproduce pandas behavior, you can use the following:
na_values_keep_null = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN',
'<NA>', 'N/A', 'NA', 'NaN', 'n/a', 'nan']
csv = pl.scan_csv(filepath)
csv_with_null = csv.with_column(
pl.when(pl.col(pl.Utf8).is_in(na_values_keep_null))
.then(None)
.otherwise(pl.col(pl.Utf8))
.keep_name()
)
na_values_keep_null is the same list as pandas' default value.
@ritchie46 is it possible this feature is already implemented even if the documentation is outdate?
pl.read_csv(r"nan_test.csv", null_values=['n/a', "#NA"]) is giving me the same results as the code I posted above. I think the docs might be out of date: https://github.com/pola-rs/polars/blob/33b03c931dd4eee4568ca95e4617e95515479401/py-polars/polars/io.py#L140
I'll post a PR fixing the docs if you agree with the above.
Yes, that's outdated
Not sure if it is useful to have to cover all cases, but multiple NA values for a specific column is not supported.
Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.
In [104]: ! cat /tmp/test.csv
a,b,c
a,NA,c
g,h,FOO
FOO,KK,NA
In [105]: pl.read_csv("/tmp/test.csv", null_values={"b": "NA", "c": "FOO", "a":
...: "KK"})
Out[105]:
shape: (3, 3)
┌─────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪══════╪══════╡
│ a ┆ null ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g ┆ h ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ FOO ┆ KK ┆ NA │
└─────┴──────┴──────┘
In [106]: pl.read_csv("/tmp/test.csv", null_values=["NA", "FOO", "KK"])
Out[106]:
shape: (3, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪══════╪══════╡
│ a ┆ null ┆ c │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g ┆ h ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
In [103]: pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FO
...: O"], "a": ["KK"]})
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-103-e872a61f70fe> in <module>
----> 1 pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FOO"], "a": ["KK"]})
~/software/polars/py-polars/polars/utils.py in wrapper(*args, **kwargs)
293 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
294 _rename_kwargs(fn.__name__, kwargs, aliases)
--> 295 return fn(*args, **kwargs)
296
297 return wrapper
~/software/polars/py-polars/polars/io.py in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
384 file, encoding=encoding, use_pyarrow=False, **storage_options
385 ) as data:
--> 386 df = DataFrame._read_csv(
387 file=data,
388 has_header=has_header,
~/software/polars/py-polars/polars/internals/dataframe/frame.py in _read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
609 projection, columns = handle_projection_columns(columns)
610
--> 611 self._df = PyDataFrame.read_csv(
612 file,
613 infer_schema_length,
RuntimeError: BindingsError: "could not extract value from null_values argument"
Is there a simple way to exclude null values? Sometimes, people do want to parse empty fields as empty strings.
@dclong: as of https://github.com/pola-rs/polars/pull/6203, you can pass missing_utf8_is_empty_string = True when reading a CSV 👍
Closing, as we do support multiple nulls - and fixed the docs for it ;) (Different nulls for different cols could be a new feature request, if requested).