polars icon indicating copy to clipboard operation
polars copied to clipboard

Silent data loss in `json_decode` with default infer_schema_length

Open severinh opened this issue 1 year ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

>>> import polars as pl
>>> 
>>> df = pl.DataFrame({"json_column": ["null"] * 1000 + ["1.0"]})
>>> df.select(pl.col("json_column").str.json_decode()) # The default behavior silently replaces the 1.0 with null.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ ---         │
│ null        │
╞═════════════╡
│ null        │
│ null        │
│ null        │
│ null        │
│ …           │
│ null        │
│ null        │
│ null        │
│ null        │
└─────────────┘
>>> df.select(pl.col("json_column").str.json_decode(infer_schema_length=None)) # Here the decoding is correct.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ ---         │
│ f64         │
╞═════════════╡
│ null        │
│ null        │
│ null        │
│ null        │
│ …           │
│ null        │
│ null        │
│ null        │
│ 1.0         │
└─────────────┘

Log output

N/A

Issue description

In our system, we read a column with a JSON value from a database into a DataFrame, and then decode it using Polars' str.json_decode().

We had a production incident where the data in the column was mostly (but not entirely) null (sparse data). In particular, the first 100 elements were null. The default behavior of json_decode is to infer the data type from the first 100 values. So in this case, the inferred data type became null, and as a result, the entire column became null. The few sparse values were dropped on the floor, resulting in wrong calculations downstream.

We solved the problem with an emergency release that set infer_schema_length to None, hereby checking the entire column.

Expected behavior

I believe this is a dangerous default behavior in the presence of such sparse data.

I can think of two solutions:

  1. Either set the default value of infer_schema_length to None, since that's the safest option. I realize though that this may not be desirable, since this could potentially be very expensive.
  2. Keep the default value of 100. But don't count null values. This would solve this particular case.

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             macOS-14.1.1-arm64-arm-64bit
Python:               3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  0.4.0
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.7.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              14.0.1
pydantic:             2.4.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.16
xlsx2csv:             0.8.1
xlsxwriter:           3.1.2

severinh avatar Dec 15 '23 13:12 severinh