polars icon indicating copy to clipboard operation
polars copied to clipboard

Failure to read/scan ndjson file with faulty line with `ignore_errors=True`

Open nbrr opened this issue 1 year ago • 0 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

example.json:

{"a": 1,"b": 3}
x
{"a": 4,"b": 2}

scan:

import polars as pl

pl.scan_ndjson('example.json', ignore_errors=True).collect()

read:

import polars as pl

pl.read_ndjson('example.json', ignore_errors=True)

Log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nbrr/Library/Caches/pypoetry/virtualenvs/env-FCFMsks8-py3.11/lib/python3.11/site-packages/polars/utils/deprecation.py", line 133, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nbrr/Library/Caches/pypoetry/virtualenvs/env-FCFMsks8-py3.11/lib/python3.11/site-packages/polars/utils/deprecation.py", line 133, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nbrr/Library/Caches/pypoetry/virtualenvs/env-FCFMsks8-py3.11/lib/python3.11/site-packages/polars/io/ndjson.py", line 110, in scan_ndjson
    return pl.LazyFrame._scan_ndjson(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nbrr/Library/Caches/pypoetry/virtualenvs/env-FCFMsks8-py3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 555, in _scan_ndjson
    self._ldf = PyLazyFrame.new_from_ndjson(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: InternalError(TapeError) at character 0 ('x')

Issue description

Both read_ndjson and scan_ndjson fail to process a ndjson file with a line that is not proper json.

Expected behavior

File example.json is read, ignoring the non-json line.

Installed versions

--------Version info---------
Polars:               0.20.4
Index type:           UInt32
Platform:             macOS-10.16-x86_64-i386-64bit
Python:               3.11.5 (main, Sep 11 2023, 08:19:27) [Clang 14.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              13.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

nbrr avatar Jan 16 '24 16:01 nbrr