polars
polars copied to clipboard
More robust parsing of CSV files with mixed line endings
This has been tested with the current polars version.
I have a measurement instrument that outputs malformed CSV files. The header line is ending with a carriage return, while all other lines end with a newline. CsvReader parses this to a dataframe that takes the column names from the first two lines.
Pandas is parsing this file correctly, also for example the csv crate is parsing these file correctly. It would be great if polars could handle this case in a more robust way.
Hey! Have you got a minimal example of what is not working in your case? See the issue form when filing a new issue.
Yes of course, sorry for forgetting example.csv
I would recommend that you fix your broken CSV files. Broken CSV files can silently corrupt your data.
Pandas is also not working correctly with them (columns are shifted as there is an additional ";" on the values lines compared with the header).
In [95]: pd.read_csv("example.csv", engine="python", sep=";")
Out[95]:
Nr StrLMin StrLMean StrLMax AccMin AccMean AccMax FocusMin FocusMean FocusMax MagnMin MagnMean MagnMax 30barMin 30barMean 30barMax DruckKMin DruckKMean DruckKMax
1 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017 NaN
2 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017 NaN
In [96]: pd.read_csv("example.csv", engine="c", sep=";")
Out[96]:
Nr StrLMin StrLMean StrLMax AccMin AccMean AccMax FocusMin FocusMean FocusMax MagnMin MagnMean MagnMax 30barMin 30barMean 30barMax DruckKMin DruckKMean DruckKMax
1 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017 NaN
2 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017 NaN
Fix your malformed CSV file with awk.
# Split records (lines) on "\r\n" or "\r": -v 'RS=\r\n|\r'
# Split fields (columns) on ";" followed with zero or more spaces: -F ';[ ]*'
# Glue all columns together again with just one ";" as separator: -v 'OFS=;'
# Remove last column (if last character on a line is ";"): if ($NF == "") { NF -= 1 };
# Reassign column 1 to column one ($1 = $1), so whole line ($0) is build up again with one ";" as column separator.
awk -F ';[ ]*' -v 'RS=\r\n|\r' -v 'OFS=;' '{ if ($NF == "") { NF -= 1 }; $1 = $1; print $0; }' example.csv > example.polars.csv
# Or more readable:
awk \
-F ';[ ]*' \
-v 'RS=\r\n|\r' \
-v 'OFS=;' \
'
{
if ($NF == "") {
# Remove last column if it was empty (line ended with ";"). (NF =Number of fields).
NF -= 1;
};
# Reassign column 1 to itself, so full line ($0) is rebuild with output field separator (OFS) ";".
$1 = $1;
# Print whole line with output field separator (OFS) ";" between all columns.
print $0;
}' \
example.csv \
> example.polars.csv
$ cat example.polars.csv
Nr;StrLMin;StrLMean;StrLMax;AccMin;AccMean;AccMax;FocusMin;FocusMean;FocusMax;MagnMin;MagnMean;MagnMax;30barMin;30barMean;30barMax;DruckKMin;DruckKMean;DruckKMax
1;0;0;5;14;19;22;0;861;65535;0;2938;3004;0;13;17;1017;1017;1017
2;0;0;5;14;19;22;0;861;65535;0;2938;3004;0;13;17;1017;1017;1017
In [93]: pd.read_csv("example.polars.csv", engine="python", sep=";")
Out[93]:
Nr StrLMin StrLMean StrLMax AccMin AccMean AccMax FocusMin FocusMean FocusMax MagnMin MagnMean MagnMax 30barMin 30barMean 30barMax DruckKMin DruckKMean DruckKMax
0 1 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017
1 2 0 0 5 14 19 22 0 861 65535 0 2938 3004 0 13 17 1017 1017 1017
In [94]: pl.read_csv("example.polars.csv", sep=";")
Out[94]:
shape: (2, 19)
┌─────┬─────────┬──────────┬─────────┬────────┬─────────┬────────┬──────────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬───────────┬──────────┬───────────┬────────────┬───────────┐
│ Nr ┆ StrLMin ┆ StrLMean ┆ StrLMax ┆ AccMin ┆ AccMean ┆ AccMax ┆ FocusMin ┆ FocusMean ┆ FocusMax ┆ MagnMin ┆ MagnMean ┆ MagnMax ┆ 30barMin ┆ 30barMean ┆ 30barMax ┆ DruckKMin ┆ DruckKMean ┆ DruckKMax │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪══════════╪═════════╪════════╪═════════╪════════╪══════════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪═══════════╪══════════╪═══════════╪════════════╪═══════════╡
│ 1 ┆ 0 ┆ 0 ┆ 5 ┆ 14 ┆ 19 ┆ 22 ┆ 0 ┆ 861 ┆ 65535 ┆ 0 ┆ 2938 ┆ 3004 ┆ 0 ┆ 13 ┆ 17 ┆ 1017 ┆ 1017 ┆ 1017 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ 0 ┆ 5 ┆ 14 ┆ 19 ┆ 22 ┆ 0 ┆ 861 ┆ 65535 ┆ 0 ┆ 2938 ┆ 3004 ┆ 0 ┆ 13 ┆ 17 ┆ 1017 ┆ 1017 ┆ 1017 │
└─────┴─────────┴──────────┴─────────┴────────┴─────────┴────────┴──────────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴───────────┴──────────┴───────────┴────────────┴───────────┘
I can understand if correctly parsing such malformed files is out of scope for the library. Would it be possible, however, that the library returns an error instead of silently parsing incorrectly?
I had an issue parsing CSV files saved with polars when trying to read the saved file using tensorflow's CsvDataset. If I write the file using Python's CSV model with the excel (default) or excel-tab dialects, there is no issue.
For reference, CsvDataset conforms to RFC 4180