polars icon indicating copy to clipboard operation
polars copied to clipboard

More robust parsing of CSV files with mixed line endings

Open cdhf opened this issue 2 years ago • 4 comments

This has been tested with the current polars version.

I have a measurement instrument that outputs malformed CSV files. The header line is ending with a carriage return, while all other lines end with a newline. CsvReader parses this to a dataframe that takes the column names from the first two lines.

Pandas is parsing this file correctly, also for example the csv crate is parsing these file correctly. It would be great if polars could handle this case in a more robust way.

cdhf avatar Dec 06 '22 20:12 cdhf

Hey! Have you got a minimal example of what is not working in your case? See the issue form when filing a new issue.

ritchie46 avatar Dec 07 '22 08:12 ritchie46

Yes of course, sorry for forgetting example.csv

cdhf avatar Dec 08 '22 10:12 cdhf

I would recommend that you fix your broken CSV files. Broken CSV files can silently corrupt your data.

Pandas is also not working correctly with them (columns are shifted as there is an additional ";" on the values lines compared with the header).

In [95]: pd.read_csv("example.csv", engine="python", sep=";")
Out[95]: 
   Nr  StrLMin  StrLMean  StrLMax   AccMin  AccMean  AccMax   FocusMin  FocusMean  FocusMax  MagnMin  MagnMean  MagnMax   30barMin  30barMean  30barMax  DruckKMin  DruckKMean  DruckKMax
1   0        0         5       14       19       22       0        861      65535         0     2938      3004        0         13         17      1017       1017        1017        NaN
2   0        0         5       14       19       22       0        861      65535         0     2938      3004        0         13         17      1017       1017        1017        NaN

In [96]: pd.read_csv("example.csv", engine="c", sep=";")
Out[96]: 
   Nr  StrLMin  StrLMean  StrLMax   AccMin  AccMean  AccMax   FocusMin  FocusMean  FocusMax  MagnMin  MagnMean  MagnMax   30barMin  30barMean  30barMax  DruckKMin  DruckKMean  DruckKMax
1   0        0         5       14       19       22       0        861      65535         0     2938      3004        0         13         17      1017       1017        1017        NaN
2   0        0         5       14       19       22       0        861      65535         0     2938      3004        0         13         17      1017       1017        1017        NaN

Fix your malformed CSV file with awk.

# Split records (lines) on "\r\n" or "\r": -v 'RS=\r\n|\r'
# Split fields (columns) on ";" followed with zero or more spaces:  -F ';[ ]*'
# Glue all columns together again with just one ";" as separator: -v 'OFS=;'
# Remove last column (if last character on a line is ";"): if ($NF == "") { NF -= 1 };
# Reassign column 1 to column one ($1 = $1), so whole line ($0) is build up again with one ";" as column separator. 
awk -F ';[ ]*' -v 'RS=\r\n|\r' -v 'OFS=;' '{ if ($NF == "") { NF -= 1 }; $1 = $1;  print $0; }' example.csv > example.polars.csv

# Or more readable:
awk \
    -F ';[ ]*' \
    -v 'RS=\r\n|\r' \
    -v 'OFS=;' \
    '
    {
        if ($NF == "") {
            # Remove last column if it was empty (line ended with ";"). (NF =Number of fields).
            NF -= 1;
        };

        # Reassign column 1 to itself, so full line ($0) is rebuild with output field separator (OFS) ";". 
        $1 = $1;

        # Print whole line with output field separator (OFS) ";" between all columns.
        print $0;
    }' \
    example.csv \
  > example.polars.csv
$ cat example.polars.csv
Nr;StrLMin;StrLMean;StrLMax;AccMin;AccMean;AccMax;FocusMin;FocusMean;FocusMax;MagnMin;MagnMean;MagnMax;30barMin;30barMean;30barMax;DruckKMin;DruckKMean;DruckKMax
1;0;0;5;14;19;22;0;861;65535;0;2938;3004;0;13;17;1017;1017;1017
2;0;0;5;14;19;22;0;861;65535;0;2938;3004;0;13;17;1017;1017;1017
In [93]: pd.read_csv("example.polars.csv", engine="python", sep=";")
Out[93]: 
   Nr  StrLMin  StrLMean  StrLMax  AccMin  AccMean  AccMax  FocusMin  FocusMean  FocusMax  MagnMin  MagnMean  MagnMax  30barMin  30barMean  30barMax  DruckKMin  DruckKMean  DruckKMax
0   1        0         0        5      14       19      22         0        861     65535        0      2938     3004         0         13        17       1017        1017       1017
1   2        0         0        5      14       19      22         0        861     65535        0      2938     3004         0         13        17       1017        1017       1017

In [94]: pl.read_csv("example.polars.csv", sep=";")
Out[94]: 
shape: (2, 19)
┌─────┬─────────┬──────────┬─────────┬────────┬─────────┬────────┬──────────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬───────────┬──────────┬───────────┬────────────┬───────────┐
│ Nr  ┆ StrLMin ┆ StrLMean ┆ StrLMax ┆ AccMin ┆ AccMean ┆ AccMax ┆ FocusMin ┆ FocusMean ┆ FocusMax ┆ MagnMin ┆ MagnMean ┆ MagnMax ┆ 30barMin ┆ 30barMean ┆ 30barMax ┆ DruckKMin ┆ DruckKMean ┆ DruckKMax │
│ --- ┆ ---     ┆ ---      ┆ ---     ┆ ---    ┆ ---     ┆ ---    ┆ ---      ┆ ---       ┆ ---      ┆ ---     ┆ ---      ┆ ---     ┆ ---      ┆ ---       ┆ ---      ┆ ---       ┆ ---        ┆ ---       │
│ i64 ┆ i64     ┆ i64      ┆ i64     ┆ i64    ┆ i64     ┆ i64    ┆ i64      ┆ i64       ┆ i64      ┆ i64     ┆ i64      ┆ i64     ┆ i64      ┆ i64       ┆ i64      ┆ i64       ┆ i64        ┆ i64       │
╞═════╪═════════╪══════════╪═════════╪════════╪═════════╪════════╪══════════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪═══════════╪══════════╪═══════════╪════════════╪═══════════╡
│ 1   ┆ 0       ┆ 0        ┆ 5       ┆ 14     ┆ 19      ┆ 22     ┆ 0        ┆ 861       ┆ 65535    ┆ 0       ┆ 2938     ┆ 3004    ┆ 0        ┆ 13        ┆ 17       ┆ 1017      ┆ 1017       ┆ 1017      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 0       ┆ 0        ┆ 5       ┆ 14     ┆ 19      ┆ 22     ┆ 0        ┆ 861       ┆ 65535    ┆ 0       ┆ 2938     ┆ 3004    ┆ 0        ┆ 13        ┆ 17       ┆ 1017      ┆ 1017       ┆ 1017      │
└─────┴─────────┴──────────┴─────────┴────────┴─────────┴────────┴──────────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴───────────┴──────────┴───────────┴────────────┴───────────┘

ghuls avatar Dec 08 '22 13:12 ghuls

I can understand if correctly parsing such malformed files is out of scope for the library. Would it be possible, however, that the library returns an error instead of silently parsing incorrectly?

cdhf avatar Dec 13 '22 09:12 cdhf

I had an issue parsing CSV files saved with polars when trying to read the saved file using tensorflow's CsvDataset. If I write the file using Python's CSV model with the excel (default) or excel-tab dialects, there is no issue.

For reference, CsvDataset conforms to RFC 4180

sirfz avatar Aug 27 '23 14:08 sirfz