polars icon indicating copy to clipboard operation
polars copied to clipboard

[feature request] Handle several null values in null_values CSV parser argument

Open olivier-lacroix opened this issue 4 years ago • 7 comments

Since https://github.com/pola-rs/polars/issues/848, polars CSV parser can detect a string corresponding to a null value and replace it during parsing.

It is quite common to have several strings that should be interpreted as nulls (in the same column), which currently polars cannot handle. It would be great if the following behaviour was implemented:

null_values
        Values to interpret as null values. You can provide a:
        - str -> all values encountered equal to this string will be nulls
        - List[str] -> all values encountered equal to these strings will be nulls
        - Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.

This would change the behaviour when a List[str] argument is provided; current behaviour could be achieved with a List[List[str]] argument, if desired. But I am not sure when that path would be used in practise: The Dict[str, List[str]] path seems a lot more practical to me.

olivier-lacroix avatar Jul 12 '21 00:07 olivier-lacroix

this would be very useful to replace pandas.read_csv with a lazy scan_csv and still get the same results.

luk-f-a avatar Aug 24 '22 15:08 luk-f-a

this would be very useful to replace pandas.read_csv with a lazy scan_csv and still get the same results.

Note that we support most of these already.

ritchie46 avatar Aug 24 '22 15:08 ritchie46

for anyone reading this issue and looking for a way to reproduce pandas behavior, you can use the following:

na_values_keep_null = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN',
                            '<NA>', 'N/A', 'NA', 'NaN', 'n/a', 'nan']
csv =  pl.scan_csv(filepath)
csv_with_null = csv.with_column(
    pl.when(pl.col(pl.Utf8).is_in(na_values_keep_null))
                 .then(None)
                 .otherwise(pl.col(pl.Utf8))
                 .keep_name()
             )

na_values_keep_null is the same list as pandas' default value.

luk-f-a avatar Aug 25 '22 08:08 luk-f-a

@ritchie46 is it possible this feature is already implemented even if the documentation is outdate?

pl.read_csv(r"nan_test.csv", null_values=['n/a', "#NA"]) is giving me the same results as the code I posted above. I think the docs might be out of date: https://github.com/pola-rs/polars/blob/33b03c931dd4eee4568ca95e4617e95515479401/py-polars/polars/io.py#L140

I'll post a PR fixing the docs if you agree with the above.

luk-f-a avatar Aug 25 '22 09:08 luk-f-a

Yes, that's outdated

ritchie46 avatar Aug 25 '22 09:08 ritchie46

Not sure if it is useful to have to cover all cases, but multiple NA values for a specific column is not supported.

Dict[str, List[str]] -> A dictionary that maps column name to a list of null value strings.
In [104]: ! cat /tmp/test.csv
a,b,c
a,NA,c
g,h,FOO
FOO,KK,NA

In [105]: pl.read_csv("/tmp/test.csv", null_values={"b": "NA", "c": "FOO", "a":
     ...: "KK"})
Out[105]: 
shape: (3, 3)
┌─────┬──────┬──────┐
│ a   ┆ b    ┆ c    │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ a   ┆ null ┆ c    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g   ┆ h    ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ FOO ┆ KK   ┆ NA   │
└─────┴──────┴──────┘

In [106]: pl.read_csv("/tmp/test.csv", null_values=["NA", "FOO", "KK"])
Out[106]: 
shape: (3, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ a    ┆ null ┆ c    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ g    ┆ h    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘


In [103]: pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FO
     ...: O"], "a": ["KK"]})
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-103-e872a61f70fe> in <module>
----> 1 pl.read_csv("/tmp/test.csv", null_values={"b": ["NA", "KK"], "c": ["FOO"], "a": ["KK"]})

~/software/polars/py-polars/polars/utils.py in wrapper(*args, **kwargs)
    293         def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    294             _rename_kwargs(fn.__name__, kwargs, aliases)
--> 295             return fn(*args, **kwargs)
    296 
    297         return wrapper

~/software/polars/py-polars/polars/io.py in read_csv(file, has_header, columns, new_columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
    384         file, encoding=encoding, use_pyarrow=False, **storage_options
    385     ) as data:
--> 386         df = DataFrame._read_csv(
    387             file=data,
    388             has_header=has_header,

~/software/polars/py-polars/polars/internals/dataframe/frame.py in _read_csv(cls, file, has_header, columns, sep, comment_char, quote_char, skip_rows, dtypes, null_values, ignore_errors, parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char)
    609         projection, columns = handle_projection_columns(columns)
    610 
--> 611         self._df = PyDataFrame.read_csv(
    612             file,
    613             infer_schema_length,

RuntimeError: BindingsError: "could not extract value from null_values argument"

ghuls avatar Oct 21 '22 23:10 ghuls

Is there a simple way to exclude null values? Sometimes, people do want to parse empty fields as empty strings.

dclong avatar Jan 02 '23 01:01 dclong

@dclong: as of https://github.com/pola-rs/polars/pull/6203, you can pass missing_utf8_is_empty_string = True when reading a CSV 👍

alexander-beedie avatar Mar 27 '23 06:03 alexander-beedie

Closing, as we do support multiple nulls - and fixed the docs for it ;) (Different nulls for different cols could be a new feature request, if requested).

alexander-beedie avatar Mar 27 '23 06:03 alexander-beedie