polars icon indicating copy to clipboard operation
polars copied to clipboard

Reading in CSV file gives an error (but through pandas it doesn't)

Open svaningelgem opened this issue 1 year ago β€’ 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
pl.read_csv("dummy.csv")

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[30], line 1
----> 1 pl.read_csv("dummy.csv", )

File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    128 @wraps(function)
    129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    130     _rename_keyword_argument(
    131         old_name, new_name, kwargs, function.__name__, version
    132     )
--> 133     return function(*args, **kwargs)

File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    128 @wraps(function)
    129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    130     _rename_keyword_argument(
    131         old_name, new_name, kwargs, function.__name__, version
    132     )
--> 133     return function(*args, **kwargs)

File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    128 @wraps(function)
    129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    130     _rename_keyword_argument(
    131         old_name, new_name, kwargs, function.__name__, version
    132     )
--> 133     return function(*args, **kwargs)

File e:\miniforge3\envs\kaggle\lib\site-packages\polars\io\csv\functions.py:397, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines)
    385         dtypes = {
    386             new_to_current.get(column_name, column_name): column_dtype
    387             for column_name, column_dtype in dtypes.items()
    388         }
    390 with _prepare_file_arg(
    391     source,
    392     encoding=encoding,
   (...)
    395     storage_options=storage_options,
    396 ) as data:
--> 397     df = pl.DataFrame._read_csv(
    398         data,
    399         has_header=has_header,
    400         columns=columns if columns else projection,
    401         separator=separator,
    402         comment_prefix=comment_prefix,
    403         quote_char=quote_char,
    404         skip_rows=skip_rows,
    405         dtypes=dtypes,
    406         schema=schema,
    407         null_values=null_values,
    408         missing_utf8_is_empty_string=missing_utf8_is_empty_string,
    409         ignore_errors=ignore_errors,
    410         try_parse_dates=try_parse_dates,
    411         n_threads=n_threads,
    412         infer_schema_length=infer_schema_length,
    413         batch_size=batch_size,
    414         n_rows=n_rows,
    415         encoding=encoding if encoding == "utf8-lossy" else "utf8",
    416         low_memory=low_memory,
    417         rechunk=rechunk,
    418         skip_rows_after_header=skip_rows_after_header,
    419         row_index_name=row_index_name,
    420         row_index_offset=row_index_offset,
    421         sample_size=sample_size,
    422         eol_char=eol_char,
    423         raise_if_empty=raise_if_empty,
    424         truncate_ragged_lines=truncate_ragged_lines,
    425     )
    427 if new_columns:
    428     return _update_columns(df, new_columns)

File e:\miniforge3\envs\kaggle\lib\site-packages\polars\dataframe\frame.py:789, in DataFrame._read_csv(cls, source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines)
    785         raise ValueError(msg)
    787 projection, columns = handle_projection_columns(columns)
--> 789 self._df = PyDataFrame.read_csv(
    790     source,
    791     infer_schema_length,
    792     batch_size,
    793     has_header,
    794     ignore_errors,
    795     n_rows,
    796     skip_rows,
    797     projection,
    798     separator,
    799     rechunk,
    800     columns,
    801     encoding,
    802     n_threads,
    803     path,
    804     dtype_list,
    805     dtype_slice,
    806     low_memory,
    807     comment_prefix,
    808     quote_char,
    809     processed_null_values,
    810     missing_utf8_is_empty_string,
    811     try_parse_dates,
    812     skip_rows_after_header,
    813     _prepare_row_index_args(row_index_name, row_index_offset),
    814     sample_size=sample_size,
    815     eol_char=eol_char,
    816     raise_if_empty=raise_if_empty,
    817     truncate_ragged_lines=truncate_ragged_lines,
    818     schema=schema,
    819 )
    820 return self

ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)

The current offset in the file is 32510 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Fools!"` to the `null_values` list.

Original error:  bytes non-empty

Issue description

I was working at a kaggle project, and couldn't load this CSV file (whilst pandas could load it no problem). I tried to isolate the faulty line, but when I cut the file before or after the mentioned error-line, it didn't want to error anymore.

So, I'm very sorry about not being able to cut down the file any shorter: dummy.csv

The problem is that pandas can load this file without an issue, but polars cannot for some reason. Around the offending line, I also don't see any particular issue. Quotes are duplicated, and enclosed so that should be fine.

Expected behavior

The file should be able to be read in.

Installed versions

--------Version info---------
Polars:               0.20.5
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              15.0.0
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.25
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

svaningelgem avatar Jan 25 '24 06:01 svaningelgem

hmm cannot reproduce on macos or windows using 0.20.5 πŸ€”

Works fine for me. Not sure what the problem is or if this is troll bc of the dataset 🀣

Does it work if you delete the specific entry?

Julian-J-S avatar Jan 25 '24 06:01 Julian-J-S

That is very bizar! I can consistently reproduce it. (windows + versions above).

And no, no trolling ;-) (loved the joke).

I tried now:

pl.from_dataframe(pd.read_csv('dummy.csv')).write_csv('dummy2.csv')
pl.read_csv('dummy2.csv')

Same error... (but the file is smaller, so I assume a difference in \r\n vs \n)

I removed index 164 (which had the place where it started to error). And I could read in the file without an issue. So next thing I did was remove everything but 164. read in fine too. Then remove everything AFTER 164. read in fine. restore the whole file. error.

That is the weird thing about isolating the particular error. It proves difficult. Plus, if you cannot replicate it, that's double weird.

I'm running polars from a notebook (and installed by pip), but the way I know the library, that shouldn't matter?

svaningelgem avatar Jan 25 '24 06:01 svaningelgem

Ok, so I tested now on (exact the same file): Ubuntu (WSL) -- fails - different environment CLI win -- fails - same environment notebook win -- fails - same environment

Always on the same line.

On Ubuntu (native Linux) it works

mamba create -n polars python=3.10 polars
mamba activate polars
python
>>> import polars as pl
>>> pl.read_csv("dummy.csv")
shape: (2_016, 2)

But the same procedure (clean polars environment + execute the commands) on my windows system leads to exact the same failure as before. So I can consistently make it fail. Even on a clean environment. And I now tried the same procedure on my windows-WSL system: also a failure...

I also tried copying the file over onto another drive, just in case that it might be an issue with faulty sectors on my hdd. Next thing: reboot. Nope, after a reboot same issue, so unlikely a memory corruption issue.

Any ideas what I can try next?

svaningelgem avatar Jan 25 '24 07:01 svaningelgem

For info, I cannot reproduce on Mac M1 (polars 0.20.5) too:

>>> pl.read_csv('dummy.csv').min()
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ index ┆ string                            β”‚
β”‚ ---   ┆ ---                               β”‚
β”‚ i64   ┆ str                               β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═══════════════════════════════════║
β”‚ 0     ┆ " Inability to cohabit with othe… β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

taki-mekhalfa avatar Jan 26 '24 12:01 taki-mekhalfa

That is so weird... What can I try more to investigate this issue? Because I can replicate it consistently... So likely it's something environment related?

svaningelgem avatar Jan 26 '24 12:01 svaningelgem

Does setting the number of threads influence this? POLARS_MAX_THREADS=2 to 16?

ritchie46 avatar Jan 27 '24 09:01 ritchie46

Hi @ritchie46 , it still failed, but with the same error, but on a different line:

polars.exceptions.ComputeError: could not parse `I'm starting to think the real solution is a tax on idiots. Undoubtedly within the Liberal party and the voters that elected these clowns there would be enough collected to pay off the entire national debt."` as dtype `i64` at column 'index' (column number 1)

When unsetting this envvar, I get the original error message:

polars.exceptions.ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)

What could I try else? I'm starting to suspect something processor/thread related because it works for most people and I upgraded recently to a 16 core (32 logical processors) machine.

svaningelgem avatar Jan 27 '24 11:01 svaningelgem

Seems I'm onto something. In Windows there is a setting (msconfig > boot > advanced > #cores), which I changed to 6 (ie: 3 cores, 6 logical processors).

...And... It worked fine:

>>> import polars as pl
>>> pl.read_csv('dummy.csv').shape
(2016, 2)

So polars can't handle a high core count?

edit: After reversing the 6-core count to unlimited (ie 32), it started to fail again. image

Seems bigger is not always better 🀣

svaningelgem avatar Jan 27 '24 11:01 svaningelgem

I tried polars 0.20.6 on two ubuntu VMs:

  • 12 cores => Works fine
  • 22 cores => Does not work, I have a similar error
>>> pl.scan_csv('dummy.csv').collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/meta2002/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: could not parse `Behold Exhibit A:` as dtype `i64` at column 'index' (column number 1)

The current offset in the file is 47165 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Behold Exhibit A:` to the `null_values` list.

Original error: ```remaining bytes non-empty```

taki-mekhalfa avatar Jan 27 '24 12:01 taki-mekhalfa

Seems like the error is related to having a quoted cell with newlines in it:

image

Wainberg avatar Jan 29 '24 06:01 Wainberg

Hi @Wainberg , I would agree if not already the first line (record 0) has a newline in it, and record 2, and 6 and 7... Why would it only start to fail at record 164 then? And why not with 12 cores, and it fails with 22 cores (as @taki-mekhalfa determined)?

I'm more suspecting some kind of CPU-optimization gone awry?

svaningelgem avatar Jan 29 '24 06:01 svaningelgem

After some debugging:

The file is split into multiple chunks to process in parallel, the number of these chunks depends on the number of cores.

The code that determines the chunks tries to start the chunks at a valid start of a record, but it's flawed when you have a quotes spanned over multiple lines and gives incorrect chunks starting at some text, that's why when we start parsing the chunk we stumble upon non numerical text, and polars raises the could not parse ... as dtype i64.

The more cores you have, the more probable it is to have wrong chunks, and you can easily create some examples that will crash even with few cores.

if you do

>>> pl.read_csv('dummy.csv', dtypes={'index': pl.String, 'string': pl.String}, ignore_errors=True)

You will have:

polars.exceptions.ComputeError: found more fields than defined in 'Schema'

which confirms chunks are incorrectly determined.

taki-mekhalfa avatar Jan 30 '24 11:01 taki-mekhalfa

Consider for example this CSV that only has 4 records:

"col1","col2"
1,a
2,b
3,2
4,"
5,1
6,1
7,1
8,1
"
>>> import pandas as pd
>>> pd.read_csv("test.csv")
   col1                    col2
0     1                       a
1     2                       b
2     3                       2
3     4  \n5,1\n6,1\n7,1\n8,1\n

A record can contain an arbitrarily long string that spans an arbitrary number of lines and looks like valid CSV records. Which means any heuristic that relies on local inspection can be wrong in some edge cases.

Thus if records can span multiple lines, I am fairly certain this is the only correct general approach:

  1. Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
  2. The chunking code relies on that.

itamarst avatar Feb 01 '24 15:02 itamarst

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

itamarst avatar Feb 01 '24 15:02 itamarst

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

It seems like chunking with quotations might be tricky but scanning the entire file should be a worst case

taki-mekhalfa avatar Feb 01 '24 15:02 taki-mekhalfa

Yes, this is tricky. We search for the right number of fields and try to ensure we are not in an embedded file. We currently find 3 corresponding lines before we accept. I will see if tuning this helps.

ritchie46 avatar Feb 02 '24 10:02 ritchie46

Ok, looking at this file. This is impossible to find a good position in the middle. This should be read single threaded.

Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".

This would be very expensive as you will need to read the whole file disk. I think we can do some inspection on schema inference and for those edge cases fall back to single core scanning.

I made a PR that sniffs during schema inference and if we find to many new-lines in escaped fields we fallback to single threaded reading.

ritchie46 avatar Feb 02 '24 10:02 ritchie46