polars icon indicating copy to clipboard operation
polars copied to clipboard

Reading CSV files with variable number of columns not supported

Open allspatial opened this issue 3 years ago • 14 comments

Are you using Python or Rust?

Python

Which feature gates did you use?

This can be ignored by Python users.

What version of polars are you using?

0.9.12

What operating system are you using polars on?

macOS

Describe your bug.

When reading a CSV file with variable number of columns, polars assumes all rows have the number of columns inferred from the first row (?) and skips parsing any subsequent columns. Providing the columns to be parsed explicitly via the columns parameter results in error:

RuntimeError: Any(NotFound("Unable to get field named "column_4". Valid fields: ["column_1", "column_2", "column_3"]"))

What are the steps to reproduce the behavior?

Dataset (test.csv):

a,b,c a,b,c,d,e,f g,h,i,j,k

Example 1 (no error but reads only 3 columns instead of 6)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_headers=False)

Example (results in error)

import polars as pl

df = pl.read_csv("/tmp/test.csv", has_header=False, infer_schema_length=0,
                 columns=["column_1", "column_2", "column_3", "column_4", "column_5", "column_6"])

What is the actual behavior?

Columns beyond the ones inferred from the first data row are not parsed.

What is the expected behavior?

All columns are parsed but are set to NaN/None for rows that don't have data for these columns.

allspatial avatar Oct 08 '21 14:10 allspatial

You can use xsv fixlengths to fix those kind of broken CSV files:

$ cat test.csv
a,b,c
a,b,c,d,e,f
g,h,i,j,k

$ xsv fixlengths test.csv 
a,b,c,,,
a,b,c,d,e,f
g,h,i,j,k,

ghuls avatar Oct 08 '21 14:10 ghuls

Many thanks! That's a very useful tool I wasn't aware of.

allspatial avatar Oct 08 '21 19:10 allspatial

In case when there is not a header present in the csv-file we use the first line to determine new column names (column_1, column_2 .., column_n). We probably should use the max line length of the lines we scan for dtype inference.

ritchie46 avatar Oct 11 '21 14:10 ritchie46

@ritchie46 Where is that code located? For https://github.com/pola-rs/polars/issues/1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

ghuls avatar Oct 12 '21 14:10 ghuls

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is: https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

ritchie46 avatar Oct 12 '21 15:10 ritchie46

@ritchie46 Where is that code located? For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Here it is:

https://github.com/pola-rs/polars/blob/3d99b45a997c981c36e1c14673491eb2b5f2a8ba/polars/polars-io/src/csv_core/utils.rs#L141

I think only the else (no-header) branch matters in this case. If there is a header, I think that should be the source of truth with regard to the number of fields.

For #1492 it would probably also be better if the column names can be retrieved as the code I have now to fix it in python only will work in specific conditions (when we got the column names as input or when they are autogenerated, but not in other cases).

Only when the column names are overwritten and there is no-header, we should modify it, I think. The other cases the dtypes dict should be correct right? So I believe we have all information to overwrite the new_names with the auto-generated ones.

Not when the user provides, new_columns.

ghuls avatar Oct 12 '21 15:10 ghuls

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

pradkrish avatar Nov 15 '21 20:11 pradkrish

I am not sure I understand the issue here. I see that CsvReader has an argument max_records, which can be used to do a full table scan for inferring the number of columns. is it about importing that variable to python API?

I think I already fixed this issue.

Edit: not entirely certain anymore

ritchie46 avatar Nov 15 '21 20:11 ritchie46

Okay, I will be happy to get the commit that you think might have fixed the issue.

pradkrish avatar Nov 15 '21 21:11 pradkrish

I normally fixed it here: https://github.com/pola-rs/polars/commit/ee26601f880c9367565303859f4ed41aa2c42339

ghuls avatar Nov 16 '21 07:11 ghuls

Any updates on this? Still doesn't work using infer_schema_length=0 or =None.

jmakov avatar Sep 12 '23 19:09 jmakov

is this issue resolved ? can i take this and open a pr

Nagaprasadvr avatar Jan 25 '24 05:01 Nagaprasadvr

@Nagaprasadvr I'm not a code reviewer so I can't give a absolutely definitive answer but @stinodego marked it as accepted and if it were fixed it'd be closed so I don't see why not.

One caveat is that it needs to be a rust fix not a python fix as the maintainers don't want feature divergence between rust and python.

deanm0000 avatar Jan 25 '24 23:01 deanm0000

ty , will take this issue and open a pr

Nagaprasadvr avatar Jan 26 '24 02:01 Nagaprasadvr