polars
polars copied to clipboard
Filtered dataframe cannot be written to `JsonWriter`
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Cannot write dataframe obtained from ParquetReader to JsonWriter.
Found the error in the step of writing dataframe to the buffer.
Dataframe was imported by ParquetReader and the converted json string was going to be written to the buffer.
But there is nothing in the buffer. (There were only '[' and ']' and the bytecodes were 91 and 93, respectivly.)
fn get_data() -> DataFrame {
return LazyFrame::scan_parquet("data/btcusdt-15m-2020-01-01-2022-12-31.parquet", ScanArgsParquet::default())
.expect("Cannot open parquet file")
.select([
cols(["openTime", "open", "high", "low", "close", "volume"]),
])
.collect()
.expect("Cannot open parquet file")
}
fn convert_df_to_json_data(df: &DataFrame) -> String {
let mut buffer = Vec::new();
JsonWriter::new(&mut buffer)
.with_json_format(JsonFormat::Json)
.finish(&mut df.clone())
.unwrap();
return String::from_utf8(buffer).unwrap()
}
Reproducible example
get_data()convert_df_to_json_data()
Expected behavior
the dataframe is written to the buffer.
Installed versions
polars = { version = "0.26.1", features = ["json", "parquet", "lazy"] }
Are you able to share minimal .parquet file that produces this bug?
I'm sorry the error occurs with data filtered.
let df = get_data();
let df = df
.lazy()
.filter(col("openTime").gt_eq(lit(1671840000000i64)).and(col("openTime").lt(lit(1671926400000i64))))
.collect()
.unwrap();
let json = convert_df_to_json_data(&df);
The buffer has to be contain the json string but the result is empty
# println!("{:?}", df); // after filtering
shape: (96, 6)
┌───────────────┬─────────┬─────────┬─────────┬─────────┬──────────┐
│ openTime ┆ open ┆ high ┆ low ┆ close ┆ volume │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═════════╪═════════╪═════════╪══════════╡
│ 1671840000000 ┆ 16776.3 ┆ 16795.6 ┆ 16774.4 ┆ 16795.6 ┆ 1360.056 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671840900000 ┆ 16795.5 ┆ 16810.4 ┆ 16790.4 ┆ 16807.7 ┆ 1732.712 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671841800000 ┆ 16807.7 ┆ 16808.9 ┆ 16797.3 ┆ 16797.6 ┆ 727.209 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671842700000 ┆ 16797.6 ┆ 16809.2 ┆ 16796.1 ┆ 16800.0 ┆ 939.037 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671922800000 ┆ 16819.5 ┆ 16828.3 ┆ 16803.6 ┆ 16828.3 ┆ 1360.459 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671923700000 ┆ 16828.3 ┆ 16839.0 ┆ 16818.0 ┆ 16820.2 ┆ 1043.76 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671924600000 ┆ 16820.3 ┆ 16842.2 ┆ 16819.8 ┆ 16833.5 ┆ 1002.26 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1671925500000 ┆ 16833.4 ┆ 16833.5 ┆ 16825.9 ┆ 16828.9 ┆ 299.408 │
└───────────────┴─────────┴─────────┴─────────┴─────────┴──────────┘
# println!("{:?}", buffer); // after executing JsonWriter
BUF[91, 93]
JSON"[]"
@ritchie46 This seems like a bug indf.rechunk()
I'm not familiar with the logic that determines when it should_rechunk, but for this specific dataset, should_rechunk returns false so the rechunk doesn't happen. If i manually rechunk via as_single_chunk_par the output is fine. So it makes me think there is potentially a bug in the should_rechunk logic.
to elaborate: If i get the head off of this df
let mut df = df.head(Some(10));
df.n_chunks(); // -> 210
df.rechunk();
df.n_chunks(); // -> 210
It seems unsound that a df of height 10 would have 210 chunks after rechunking.
@seongs1024 as a workaround, you can use as_single_chunk_par or as_single_chunk before writing to json.
fn convert_df_to_json_data(df: &mut DataFrame) -> String {
df.as_single_chunk_par();
let mut buffer = Vec::new();
JsonWriter::new(&mut buffer)
.with_json_format(JsonFormat::Json)
.finish(df)
.unwrap();
return String::from_utf8(buffer).unwrap()
}
Thank you @universalmind303 it works when as_single_chunk_par() is added!