polars
polars copied to clipboard
Smarter fields inferencing for CSV input
Problem description
Hi, I do not know if it's right to write there. If I'm wrong, I apologize.
I'm using nushell and its polars dataframe features. When nushell import a CSV as a dataframe, it uses the polars crate.
Daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.
A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione, codice_provincia and codice_comune fields). In Italy, the city of Losine has the code 0870, and obviously it's a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.
In nushell I run the command below to create a dataframe, starting from the below CSV
let df = open-df --infer-schema 200000 input.csv
and the codice_regione, codice_provincia and codice_comune fields are no longer strings, they become all numbers and the cell values have changed (in example the code of Losine becomes 870).
Four fields are modified and no longer usable.
I know, octal numbers exist, but in 99% of cases (I'm talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes. And I know I can force all to strings, but then I must correct all numeric fields.
If it were possible, it would be nice to add an inferencing subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.
Unfortunately I don't know how to write the code to make a PR.
Thank you

The input file
comune,codice_regione,codice_provincia,codice_comune,denominazione_comune,sigla_provincia,data_entrata_in_carica
030151360,03,015,1360,POLPENAZZE DEL GARDA,BS,13/10/2021
030120530,03,012,0530,CAROBBIO DEGLI ANGELI,BG,04/10/2021
020040580,02,004,0580,SAINT-DENIS,AO,23/09/2020
030150870,03,015,0870,LOSINE,BS,04/10/2021
190480090,19,048,0090,CAPO D'ORLANDO,ME,27/10/2021
It looks like it's controlled by the regex here: https://github.com/pola-rs/polars/blob/master/polars/polars-io/src/csv/utils.rs#L89
It could be changed to
static INTEGER_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^\s*-?([^0]\d+|0)$").unwrap());
before:
>>> pl.read_csv("leading-zero.csv")
shape: (5, 7)
┌───────────┬────────────────┬──────────────────┬───────────────┬───────────────────────┬─────────────────┬────────────────────────┐
│ comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica │
│ --- | --- | --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | i64 | str | str | str │
╞═══════════╪════════════════╪══════════════════╪═══════════════╪═══════════════════════╪═════════════════╪════════════════════════╡
│ 30151360 | 3 | 15 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 │
after:
>>> pl.read_csv("leading-zero.csv")
shape: (5, 7)
┌───────────┬────────────────┬──────────────────┬───────────────┬───────────────────────┬─────────────────┬────────────────────────┐
│ comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | str │
╞═══════════╪════════════════╪══════════════════╪═══════════════╪═══════════════════════╪═════════════════╪════════════════════════╡
│ 030151360 | 03 | 015 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 │
Perhaps the behaviour could be controlled with .read_csv() parameter.
I'm not able to contribute to code.
But it would be great to have it in polars.
Thank you