polars icon indicating copy to clipboard operation
polars copied to clipboard

Smarter fields inferencing for CSV input

Open aborruso opened this issue 2 years ago • 2 comments
trafficstars

Problem description

Hi, I do not know if it's right to write there. If I'm wrong, I apologize.

I'm using nushell and its polars dataframe features. When nushell import a CSV as a dataframe, it uses the polars crate.

Daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.

A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione, codice_provincia and codice_comune fields). In Italy, the city of Losine has the code 0870, and obviously it's a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.

In nushell I run the command below to create a dataframe, starting from the below CSV

let df = open-df --infer-schema 200000 input.csv

and the codice_regione, codice_provincia and codice_comune fields are no longer strings, they become all numbers and the cell values have changed (in example the code of Losine becomes 870). Four fields are modified and no longer usable.

I know, octal numbers exist, but in 99% of cases (I'm talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes. And I know I can force all to strings, but then I must correct all numeric fields.

If it were possible, it would be nice to add an inferencing subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.

Unfortunately I don't know how to write the code to make a PR.

Thank you

image

The input file

comune,codice_regione,codice_provincia,codice_comune,denominazione_comune,sigla_provincia,data_entrata_in_carica
030151360,03,015,1360,POLPENAZZE DEL GARDA,BS,13/10/2021
030120530,03,012,0530,CAROBBIO DEGLI ANGELI,BG,04/10/2021
020040580,02,004,0580,SAINT-DENIS,AO,23/09/2020
030150870,03,015,0870,LOSINE,BS,04/10/2021
190480090,19,048,0090,CAPO D'ORLANDO,ME,27/10/2021

aborruso avatar Dec 28 '22 18:12 aborruso

It looks like it's controlled by the regex here: https://github.com/pola-rs/polars/blob/master/polars/polars-io/src/csv/utils.rs#L89

It could be changed to

static INTEGER_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^\s*-?([^0]\d+|0)$").unwrap());

before:

>>> pl.read_csv("leading-zero.csv")
shape: (5, 7)
┌───────────┬────────────────┬──────────────────┬───────────────┬───────────────────────┬─────────────────┬────────────────────────┐
│ comune    | codice_regione | codice_provincia | codice_comune | denominazione_comune  | sigla_provincia | data_entrata_in_carica │
│ ---       | ---            | ---              | ---           | ---                   | ---             | ---                    │
│ i64       | i64            | i64              | i64           | str                   | str             | str                    │
╞═══════════╪════════════════╪══════════════════╪═══════════════╪═══════════════════════╪═════════════════╪════════════════════════╡
│ 30151360  | 3              | 15               | 1360          | POLPENAZZE DEL GARDA  | BS              | 13/10/2021             │

after:

>>> pl.read_csv("leading-zero.csv")
shape: (5, 7)
┌───────────┬────────────────┬──────────────────┬───────────────┬───────────────────────┬─────────────────┬────────────────────────┐
│ comune    | codice_regione | codice_provincia | codice_comune | denominazione_comune  | sigla_provincia | data_entrata_in_carica │
│ ---       | ---            | ---              | ---           | ---                   | ---             | ---                    │
│ str       | str            | str              | str           | str                   | str             | str                    │
╞═══════════╪════════════════╪══════════════════╪═══════════════╪═══════════════════════╪═════════════════╪════════════════════════╡
│ 030151360 | 03             | 015              | 1360          | POLPENAZZE DEL GARDA  | BS              | 13/10/2021             │

Perhaps the behaviour could be controlled with .read_csv() parameter.

cmdlineluser avatar Dec 28 '22 20:12 cmdlineluser

I'm not able to contribute to code.

But it would be great to have it in polars.

Thank you

aborruso avatar Dec 28 '22 20:12 aborruso