polars Reading CSV files with variable number of columns not supported

Are you using Python or Rust?

Python

Which feature gates did you use?

This can be ignored by Python users.

What version of polars are you using?

0.9.12

What operating system are you using polars on?

macOS

Describe your bug.

When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:

RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))

What are the steps to reproduce the behavior?

Dataset (test.csv):

a,b,c a,b,c,d,e,f g,h,i,j,k

Example 1 (no error but reads only 3 columns instead of 6)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_headers=False)

Example (results in error)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_header=False, infer_schema_length=0,
                 columns=["column_1", "column_2", "column_3", "column_4", "column_5", "column_6"])

What is the actual behavior?

Columns beyond the ones inferred from the first data row are not parsed.

What is the expected behavior?

All columns are parsed but are set to NaN/None for rows that don't have data for these columns.

Oct 08 '21 14:10 allspatial

You can use xsv fixlengths to fix those kind of broken CSV files:

$ cat test.csv
a,b,c
a,b,c,d,e,f
g,h,i,j,k

$ xsv fixlengths test.csv 
a,b,c,,,
a,b,c,d,e,f
g,h,i,j,k,

Oct 08 '21 14:10 ghuls

Many thanks! That's a very useful tool I wasn't aware of.

Oct 08 '21 19:10 allspatial

In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference.

Oct 11 '21 14:10 ritchie46

@ritchie46 Where is that code located? For https://github.com/pola-rs/polars/issues/1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Oct 12 '21 14:10 ghuls

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is: https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

Oct 12 '21 15:10 ritchie46

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

Not when the user provides, new_columns.

Oct 12 '21 15:10 ghuls

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

Nov 15 '21 20:11 pradkrish

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

I think I already fixed this issue.

Edit: not entirely certain anymore

Nov 15 '21 20:11 ritchie46

Okay, I will be happy to get the commit that you think might have fixed the issue.

Nov 15 '21 21:11 pradkrish

I normally fixed it here: https://github.com/pola-rs/polars/commit/ee26601f880c9367565303859f4ed41aa2c42339

Nov 16 '21 07:11 ghuls

Any updates on this? Still doesn't work using infer_schema_length=0 or =None.

Sep 12 '23 19:09 jmakov

is this issue resolved ? can i take this and open a pr

Jan 25 '24 05:01 Nagaprasadvr

@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not.

One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python.

Jan 25 '24 23:01 deanm0000

ty , will take this issue and open a pr

Jan 26 '24 02:01 Nagaprasadvr

polars polars copied to clipboard

Reading CSV files with variable number of columns not supported

Are you using Python or Rust?

Which feature gates did you use?

What version of polars are you using?

What operating system are you using polars on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

polars
polars copied to clipboard