polars
polars copied to clipboard
Silent data loss in `json_decode` with default infer_schema_length
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
>>> import polars as pl
>>>
>>> df = pl.DataFrame({"json_column": ["null"] * 1000 + ["1.0"]})
>>> df.select(pl.col("json_column").str.json_decode()) # The default behavior silently replaces the 1.0 with null.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ --- │
│ null │
╞═════════════╡
│ null │
│ null │
│ null │
│ null │
│ … │
│ null │
│ null │
│ null │
│ null │
└─────────────┘
>>> df.select(pl.col("json_column").str.json_decode(infer_schema_length=None)) # Here the decoding is correct.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ --- │
│ f64 │
╞═════════════╡
│ null │
│ null │
│ null │
│ null │
│ … │
│ null │
│ null │
│ null │
│ 1.0 │
└─────────────┘
Log output
N/A
Issue description
In our system, we read a column with a JSON value from a database into a DataFrame, and then decode it using Polars' str.json_decode()
.
We had a production incident where the data in the column was mostly (but not entirely) null
(sparse data). In particular, the first 100 elements were null
. The default behavior of json_decode
is to infer the data type from the first 100 values. So in this case, the inferred data type became null
, and as a result, the entire column became null
. The few sparse values were dropped on the floor, resulting in wrong calculations downstream.
We solved the problem with an emergency release that set infer_schema_length
to None
, hereby checking the entire column.
Expected behavior
I believe this is a dangerous default behavior in the presence of such sparse data.
I can think of two solutions:
- Either set the default value of
infer_schema_length
toNone
, since that's the safest option. I realize though that this may not be desirable, since this could potentially be very expensive. - Keep the default value of 100. But don't count null values. This would solve this particular case.
Installed versions
--------Version info---------
Polars: 0.19.19
Index type: UInt32
Platform: macOS-14.1.1-arm64-arm-64bit
Python: 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)]
----Optional dependencies----
adbc_driver_manager: 0.4.0
cloudpickle: <not installed>
connectorx: 0.3.2
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
matplotlib: 3.7.2
numpy: 1.26.2
openpyxl: 3.1.2
pandas: 2.1.3
pyarrow: 14.0.1
pydantic: 2.4.2
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.16
xlsx2csv: 0.8.1
xlsxwriter: 3.1.2