polars [feature request] Handle several null values in null

Since https://github.com/pola-rs/polars/issues/848, polars CSV parser can detect a string corresponding to a null value and replace it during parsing.

It is quite common to have several strings that should be interpreted as nulls (in the same column), which currently polars cannot handle. It would be great if the following behaviour was implemented:

null_values
        Values to interpret as null values. You can provide a:
        - str -> all values encountered equal to this string will be nulls
        - List[str] -> all values encountered equal to these strings will be nulls
        - Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.

This would change the behaviour when a List[str] argument is provided; current behaviour could be achieved with a List[List[str]] argument, if desired. But I am not sure when that path would be used in practise: The Dict[str, List[str]] path seems a lot more practical to me.

Jul 12 '21 00:07 olivier-lacroix

this would be very useful to replace pandas.read_csv with a lazy scan_csv and still get the same results.

Aug 24 '22 15:08 luk-f-a

this would be very useful to replace pandas.read_csv with a lazy scan_csv and still get the same results.

Note that we support most of these already.

Aug 24 '22 15:08 ritchie46

for anyone reading this issue and looking for a way to reproduce pandas behavior, you can use the following:

na_values_keep_null = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN',
                            '<NA>', 'N/A', 'NA', 'NaN', 'n/a', 'nan']
csv =  pl.scan_csv(filepath)
csv_with_null = csv.with_column(
    pl.when(pl.col(pl.Utf8).is_in(na_values_keep_null))
                 .then(None)
                 .otherwise(pl.col(pl.Utf8))
                 .keep_name()
             )

na_values_keep_null is the same list as pandas' default value.

Aug 25 '22 08:08 luk-f-a

@ritchie46 is it possible this feature is already implemented even if the documentation is outdate?

pl.read_csv(r"nan_test.csv", null_values=['n/a', "#NA"]) is giving me the same results as the code I posted above. I think the docs might be out of date: https://github.com/pola-rs/polars/blob/33b03c931dd4eee4568ca95e4617e95515479401/py-polars/polars/io.py#L140

I'll post a PR fixing the docs if you agree with the above.

Aug 25 '22 09:08 luk-f-a

Yes, that's outdated

Aug 25 '22 09:08 ritchie46

Not sure if it is useful to have to cover all cases, but multiple NA values for a specific column is not supported.

Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.

In [104]: ! cat /tmp/test.csv
a,b,c
a,NA,c
g,h,FOO
FOO,KK,NA

In [105]: pl.read_csv("/tmp/test.csv", null_values={"b": "NA", "c": "FOO", "a":
     ...: "KK"})
Out[105]: 
shape: (3, 3)
┌─────┬──────┬──────┐
│ a   ┆ b    ┆ c    │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ a   ┆ null ┆ c    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g   ┆ h    ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ FOO ┆ KK   ┆ NA   │
└─────┴──────┴──────┘

In [106]: pl.read_csv("/tmp/test.csv", null_values=["NA", "FOO", "KK"])
Out[106]: 
shape: (3, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ a    ┆ null ┆ c    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g    ┆ h    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘


In [103]: pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FO
     ...: O"], "a": ["KK"]})
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-103-e872a61f70fe> in <module>
----> 1 pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FOO"], "a": ["KK"]})

~/software/polars/py-polars/polars/utils.py in wrapper(*args, **kwargs)
    293         def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    294             _rename_kwargs(fn.__name__, kwargs, aliases)
--> 295             return fn(*args, **kwargs)
    296 
    297         return wrapper

~/software/polars/py-polars/polars/io.py in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
    384         file, encoding=encoding, use_pyarrow=False, **storage_options
    385     ) as data:
--> 386         df = DataFrame._read_csv(
    387             file=data,
    388             has_header=has_header,

~/software/polars/py-polars/polars/internals/dataframe/frame.py in _read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
    609         projection, columns = handle_projection_columns(columns)
    610 
--> 611         self._df = PyDataFrame.read_csv(
    612             file,
    613             infer_schema_length,

RuntimeError: BindingsError: "could not extract value from null_values argument"

Oct 21 '22 23:10 ghuls

Is there a simple way to exclude null values? Sometimes, people do want to parse empty fields as empty strings.

Jan 02 '23 01:01 dclong

@dclong: as of https://github.com/pola-rs/polars/pull/6203, you can pass missing_utf8_is_empty_string = True when reading a CSV 👍

Mar 27 '23 06:03 alexander-beedie

Closing, as we do support multiple nulls - and fixed the docs for it ;) (Different nulls for different cols could be a new feature request, if requested).

Mar 27 '23 06:03 alexander-beedie

polars
polars copied to clipboard

[feature request] Handle several null values in null_values CSV parser argument

polars polars copied to clipboard

[feature request] Handle several null values in null_values CSV parser argument

polars
polars copied to clipboard