polars
polars copied to clipboard
Reading in CSV file gives an error (but through pandas it doesn't)
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
pl.read_csv("dummy.csv")
Log output
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[30], line 1
----> 1 pl.read_csv("dummy.csv", )
File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
128 @wraps(function)
129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
130 _rename_keyword_argument(
131 old_name, new_name, kwargs, function.__name__, version
132 )
--> 133 return function(*args, **kwargs)
File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
128 @wraps(function)
129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
130 _rename_keyword_argument(
131 old_name, new_name, kwargs, function.__name__, version
132 )
--> 133 return function(*args, **kwargs)
File e:\miniforge3\envs\kaggle\lib\site-packages\polars\utils\deprecation.py:133, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
128 @wraps(function)
129 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
130 _rename_keyword_argument(
131 old_name, new_name, kwargs, function.__name__, version
132 )
--> 133 return function(*args, **kwargs)
File e:\miniforge3\envs\kaggle\lib\site-packages\polars\io\csv\functions.py:397, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines)
385 dtypes = {
386 new_to_current.get(column_name, column_name): column_dtype
387 for column_name, column_dtype in dtypes.items()
388 }
390 with _prepare_file_arg(
391 source,
392 encoding=encoding,
(...)
395 storage_options=storage_options,
396 ) as data:
--> 397 df = pl.DataFrame._read_csv(
398 data,
399 has_header=has_header,
400 columns=columns if columns else projection,
401 separator=separator,
402 comment_prefix=comment_prefix,
403 quote_char=quote_char,
404 skip_rows=skip_rows,
405 dtypes=dtypes,
406 schema=schema,
407 null_values=null_values,
408 missing_utf8_is_empty_string=missing_utf8_is_empty_string,
409 ignore_errors=ignore_errors,
410 try_parse_dates=try_parse_dates,
411 n_threads=n_threads,
412 infer_schema_length=infer_schema_length,
413 batch_size=batch_size,
414 n_rows=n_rows,
415 encoding=encoding if encoding == "utf8-lossy" else "utf8",
416 low_memory=low_memory,
417 rechunk=rechunk,
418 skip_rows_after_header=skip_rows_after_header,
419 row_index_name=row_index_name,
420 row_index_offset=row_index_offset,
421 sample_size=sample_size,
422 eol_char=eol_char,
423 raise_if_empty=raise_if_empty,
424 truncate_ragged_lines=truncate_ragged_lines,
425 )
427 if new_columns:
428 return _update_columns(df, new_columns)
File e:\miniforge3\envs\kaggle\lib\site-packages\polars\dataframe\frame.py:789, in DataFrame._read_csv(cls, source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines)
785 raise ValueError(msg)
787 projection, columns = handle_projection_columns(columns)
--> 789 self._df = PyDataFrame.read_csv(
790 source,
791 infer_schema_length,
792 batch_size,
793 has_header,
794 ignore_errors,
795 n_rows,
796 skip_rows,
797 projection,
798 separator,
799 rechunk,
800 columns,
801 encoding,
802 n_threads,
803 path,
804 dtype_list,
805 dtype_slice,
806 low_memory,
807 comment_prefix,
808 quote_char,
809 processed_null_values,
810 missing_utf8_is_empty_string,
811 try_parse_dates,
812 skip_rows_after_header,
813 _prepare_row_index_args(row_index_name, row_index_offset),
814 sample_size=sample_size,
815 eol_char=eol_char,
816 raise_if_empty=raise_if_empty,
817 truncate_ragged_lines=truncate_ragged_lines,
818 schema=schema,
819 )
820 return self
ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)
The current offset in the file is 32510 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Fools!"` to the `null_values` list.
Original error: bytes non-empty
Issue description
I was working at a kaggle project, and couldn't load this CSV file (whilst pandas could load it no problem). I tried to isolate the faulty line, but when I cut the file before or after the mentioned error-line, it didn't want to error anymore.
So, I'm very sorry about not being able to cut down the file any shorter: dummy.csv
The problem is that pandas can load this file without an issue, but polars cannot for some reason. Around the offending line, I also don't see any particular issue. Quotes are duplicated, and enclosed so that should be fine.
Expected behavior
The file should be able to be read in.
Installed versions
--------Version info---------
Polars: 0.20.5
Index type: UInt32
Platform: Windows-10-10.0.19045-SP0
Python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.12.2
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.2
numpy: 1.26.3
openpyxl: <not installed>
pandas: 2.1.4
pyarrow: 15.0.0
pydantic: 2.5.3
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.25
xlsx2csv: <not installed>
xlsxwriter: <not installed>
hmm cannot reproduce on macos or windows using 0.20.5 π€
Works fine for me. Not sure what the problem is or if this is troll bc of the dataset π€£
Does it work if you delete the specific entry?
That is very bizar! I can consistently reproduce it. (windows + versions above).
And no, no trolling ;-) (loved the joke).
I tried now:
pl.from_dataframe(pd.read_csv('dummy.csv')).write_csv('dummy2.csv')
pl.read_csv('dummy2.csv')
Same error... (but the file is smaller, so I assume a difference in \r\n
vs \n
)
I removed index 164 (which had the place where it started to error). And I could read in the file without an issue. So next thing I did was remove everything but 164. read in fine too. Then remove everything AFTER 164. read in fine. restore the whole file. error.
That is the weird thing about isolating the particular error. It proves difficult. Plus, if you cannot replicate it, that's double weird.
I'm running polars from a notebook (and installed by pip), but the way I know the library, that shouldn't matter?
Ok, so I tested now on (exact the same file): Ubuntu (WSL) -- fails - different environment CLI win -- fails - same environment notebook win -- fails - same environment
Always on the same line.
On Ubuntu (native Linux) it works
mamba create -n polars python=3.10 polars
mamba activate polars
python
>>> import polars as pl
>>> pl.read_csv("dummy.csv")
shape: (2_016, 2)
But the same procedure (clean polars environment + execute the commands) on my windows system leads to exact the same failure as before. So I can consistently make it fail. Even on a clean environment. And I now tried the same procedure on my windows-WSL system: also a failure...
I also tried copying the file over onto another drive, just in case that it might be an issue with faulty sectors on my hdd. Next thing: reboot. Nope, after a reboot same issue, so unlikely a memory corruption issue.
Any ideas what I can try next?
For info, I cannot reproduce on Mac M1 (polars 0.20.5) too:
>>> pl.read_csv('dummy.csv').min()
shape: (1, 2)
βββββββββ¬ββββββββββββββββββββββββββββββββββββ
β index β string β
β --- β --- β
β i64 β str β
βββββββββͺββββββββββββββββββββββββββββββββββββ‘
β 0 β " Inability to cohabit with otheβ¦ β
βββββββββ΄ββββββββββββββββββββββββββββββββββββ
That is so weird... What can I try more to investigate this issue? Because I can replicate it consistently... So likely it's something environment related?
Does setting the number of threads influence this? POLARS_MAX_THREADS=2
to 16?
Hi @ritchie46 , it still failed, but with the same error, but on a different line:
polars.exceptions.ComputeError: could not parse `I'm starting to think the real solution is a tax on idiots. Undoubtedly within the Liberal party and the voters that elected these clowns there would be enough collected to pay off the entire national debt."` as dtype `i64` at column 'index' (column number 1)
When unsetting this envvar, I get the original error message:
polars.exceptions.ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)
What could I try else? I'm starting to suspect something processor/thread related because it works for most people and I upgraded recently to a 16 core (32 logical processors) machine.
Seems I'm onto something. In Windows there is a setting (msconfig > boot > advanced > #cores), which I changed to 6 (ie: 3 cores, 6 logical processors).
...And... It worked fine:
>>> import polars as pl
>>> pl.read_csv('dummy.csv').shape
(2016, 2)
So polars can't handle a high core count?
edit: After reversing the 6-core count to unlimited (ie 32), it started to fail again.
Seems bigger is not always better π€£
I tried polars 0.20.6 on two ubuntu VMs:
- 12 cores => Works fine
- 22 cores => Does not work, I have a similar error
>>> pl.scan_csv('dummy.csv').collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/meta2002/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
return wrap_df(ldf.collect())
polars.exceptions.ComputeError: could not parse `Behold Exhibit A:` as dtype `i64` at column 'index' (column number 1)
The current offset in the file is 47165 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Behold Exhibit A:` to the `null_values` list.
Original error: ```remaining bytes non-empty```
Seems like the error is related to having a quoted cell with newlines in it:
Hi @Wainberg , I would agree if not already the first line (record 0) has a newline in it, and record 2, and 6 and 7... Why would it only start to fail at record 164 then? And why not with 12 cores, and it fails with 22 cores (as @taki-mekhalfa determined)?
I'm more suspecting some kind of CPU-optimization gone awry?
After some debugging:
The file is split into multiple chunks to process in parallel, the number of these chunks depends on the number of cores.
The code that determines the chunks tries to start the chunks at a valid start of a record, but it's flawed when you have a quotes spanned over multiple lines and gives incorrect chunks starting at some text, that's why when we start parsing the chunk we stumble upon non numerical text, and polars raises the could not parse ... as dtype i64
.
The more cores you have, the more probable it is to have wrong chunks, and you can easily create some examples that will crash even with few cores.
if you do
>>> pl.read_csv('dummy.csv', dtypes={'index': pl.String, 'string': pl.String}, ignore_errors=True)
You will have:
polars.exceptions.ComputeError: found more fields than defined in 'Schema'
which confirms chunks are incorrectly determined.
Consider for example this CSV that only has 4 records:
"col1","col2"
1,a
2,b
3,2
4,"
5,1
6,1
7,1
8,1
"
>>> import pandas as pd
>>> pd.read_csv("test.csv")
col1 col2
0 1 a
1 2 b
2 3 2
3 4 \n5,1\n6,1\n7,1\n8,1\n
A record can contain an arbitrarily long string that spans an arbitrary number of lines and looks like valid CSV records. Which means any heuristic that relies on local inspection can be wrong in some edge cases.
Thus if records can span multiple lines, I am fairly certain this is the only correct general approach:
- Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
- The chunking code relies on that.
And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.
And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.
It seems like chunking with quotations might be tricky but scanning the entire file should be a worst case
Yes, this is tricky. We search for the right number of fields and try to ensure we are not in an embedded file. We currently find 3 corresponding lines before we accept. I will see if tuning this helps.
Ok, looking at this file. This is impossible to find a good position in the middle. This should be read single threaded.
Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
This would be very expensive as you will need to read the whole file disk. I think we can do some inspection on schema inference and for those edge cases fall back to single core scanning.
I made a PR that sniffs during schema inference and if we find to many new-lines in escaped fields we fallback to single threaded reading.