polars icon indicating copy to clipboard operation
polars copied to clipboard

Filtered dataframe cannot be written to `JsonWriter`

Open seongs1024 opened this issue 2 years ago • 6 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Cannot write dataframe obtained from ParquetReader to JsonWriter.

Found the error in the step of writing dataframe to the buffer. Dataframe was imported by ParquetReader and the converted json string was going to be written to the buffer. But there is nothing in the buffer. (There were only '[' and ']' and the bytecodes were 91 and 93, respectivly.)

fn get_data() -> DataFrame {
    return LazyFrame::scan_parquet("data/btcusdt-15m-2020-01-01-2022-12-31.parquet", ScanArgsParquet::default())
        .expect("Cannot open parquet file")
        .select([
            cols(["openTime", "open", "high", "low", "close", "volume"]),
        ])
        .collect()
        .expect("Cannot open parquet file")
}

fn convert_df_to_json_data(df: &DataFrame) -> String {
    let mut buffer = Vec::new();

    JsonWriter::new(&mut buffer)
        .with_json_format(JsonFormat::Json)
        .finish(&mut df.clone())
        .unwrap();

    return String::from_utf8(buffer).unwrap()
}

Reproducible example

  1. get_data()
  2. convert_df_to_json_data()

Expected behavior

the dataframe is written to the buffer.

Installed versions

polars = { version = "0.26.1", features = ["json", "parquet", "lazy"] }

seongs1024 avatar Jan 25 '23 19:01 seongs1024

Are you able to share minimal .parquet file that produces this bug?

universalmind303 avatar Jan 26 '23 00:01 universalmind303

btcusdt-15m-2020-01-01-2022-12-31.zip

It is the file!

seongs1024 avatar Jan 26 '23 05:01 seongs1024

I'm sorry the error occurs with data filtered.

 let df = get_data();
 let df = df
        .lazy()
        .filter(col("openTime").gt_eq(lit(1671840000000i64)).and(col("openTime").lt(lit(1671926400000i64))))
        .collect()
        .unwrap();
  let json = convert_df_to_json_data(&df);

The buffer has to be contain the json string but the result is empty

# println!("{:?}", df); // after filtering

shape: (96, 6)
┌───────────────┬─────────┬─────────┬─────────┬─────────┬──────────┐
│ openTime      ┆ open    ┆ high    ┆ low     ┆ close   ┆ volume   │
│ ---           ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---      │
│ i64           ┆ f64     ┆ f64     ┆ f64     ┆ f64     ┆ f64      │
╞═══════════════╪═════════╪═════════╪═════════╪═════════╪══════════╡
│ 1671840000000 ┆ 16776.3 ┆ 16795.6 ┆ 16774.4 ┆ 16795.6 ┆ 1360.056 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671840900000 ┆ 16795.5 ┆ 16810.4 ┆ 16790.4 ┆ 16807.7 ┆ 1732.712 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671841800000 ┆ 16807.7 ┆ 16808.9 ┆ 16797.3 ┆ 16797.6 ┆ 727.209  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671842700000 ┆ 16797.6 ┆ 16809.2 ┆ 16796.1 ┆ 16800.0 ┆ 939.037  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ...           ┆ ...     ┆ ...     ┆ ...     ┆ ...     ┆ ...      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671922800000 ┆ 16819.5 ┆ 16828.3 ┆ 16803.6 ┆ 16828.3 ┆ 1360.459 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671923700000 ┆ 16828.3 ┆ 16839.0 ┆ 16818.0 ┆ 16820.2 ┆ 1043.76  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671924600000 ┆ 16820.3 ┆ 16842.2 ┆ 16819.8 ┆ 16833.5 ┆ 1002.26  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671925500000 ┆ 16833.4 ┆ 16833.5 ┆ 16825.9 ┆ 16828.9 ┆ 299.408  │
└───────────────┴─────────┴─────────┴─────────┴─────────┴──────────┘
# println!("{:?}", buffer); // after executing JsonWriter

BUF[91, 93]

JSON"[]"

seongs1024 avatar Jan 26 '23 06:01 seongs1024

@ritchie46 This seems like a bug indf.rechunk()

I'm not familiar with the logic that determines when it should_rechunk, but for this specific dataset, should_rechunk returns false so the rechunk doesn't happen. If i manually rechunk via as_single_chunk_par the output is fine. So it makes me think there is potentially a bug in the should_rechunk logic.

to elaborate: If i get the head off of this df

let mut df = df.head(Some(10));
df.n_chunks(); // -> 210
df.rechunk();
df.n_chunks(); // -> 210

It seems unsound that a df of height 10 would have 210 chunks after rechunking.

universalmind303 avatar Jan 26 '23 17:01 universalmind303

@seongs1024 as a workaround, you can use as_single_chunk_par or as_single_chunk before writing to json.

fn convert_df_to_json_data(df: &mut DataFrame) -> String {
    df.as_single_chunk_par();
    let mut buffer = Vec::new();

    JsonWriter::new(&mut buffer)
        .with_json_format(JsonFormat::Json)
        .finish(df)
        .unwrap();

    return String::from_utf8(buffer).unwrap()
}

universalmind303 avatar Jan 26 '23 17:01 universalmind303

Thank you @universalmind303 it works when as_single_chunk_par() is added!

seongs1024 avatar Jan 27 '23 05:01 seongs1024