polars
polars copied to clipboard
Reading CSV files with variable number of columns not supported
Are you using Python or Rust?
Python
Which feature gates did you use?
This can be ignored by Python users.
What version of polars are you using?
0.9.12
What operating system are you using polars on?
macOS
Describe your bug.
When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:
RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))
What are the steps to reproduce the behavior?
Dataset (test.csv):
a,b,c a,b,c,d,e,f g,h,i,j,k
Example 1 (no error but reads only 3 columns instead of 6)
import polars as pl
df = pl.read_csv("/tmp/test.csv", has_headers=False)
Example (results in error)
import polars as pl
df = pl.read_csv("/tmp/test.csv", has_header=False, infer_schema_length=0,
columns=["column_1", "column_2", "column_3", "column_4", "column_5", "column_6"])
What is the actual behavior?
Columns beyond the ones inferred from the first data row are not parsed.
What is the expected behavior?
All columns are parsed but are set to NaN/None for rows that don't have data for these columns.
You can use xsv fixlengths
to fix those kind of broken CSV files:
$ cat test.csv
a,b,c
a,b,c,d,e,f
g,h,i,j,k
$ xsv fixlengths test.csv
a,b,c,,,
a,b,c,d,e,f
g,h,i,j,k,
Many thanks! That's a very useful tool I wasn't aware of.
In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference.
@ritchie46 Where is that code located? For https://github.com/pola-rs/polars/issues/1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).
@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).
Here it is: https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141
I think only the else
(no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.
For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).
Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.
@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).
Here it is:
https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141
I think only the
else
(no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).
Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.
Not when the user provides, new_columns
.
I am not sure I understand the issue here. I see that CsvReader
has an argument max_records
, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?
I am not sure I understand the issue here. I see that
CsvReader
has an argumentmax_records
, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?
I think I already fixed this issue.
Edit: not entirely certain anymore
Okay, I will be happy to get the commit that you think might have fixed the issue.
I normally fixed it here: https://github.com/pola-rs/polars/commit/ee26601f880c9367565303859f4ed41aa2c42339
Any updates on this? Still doesn't work using infer_schema_length=0
or =None
.
is this issue resolved ? can i take this and open a pr
@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not.
One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python.
ty , will take this issue and open a pr