r-polars
r-polars copied to clipboard
Why does query error takes so much time to appear on data from `scan_*`?
One advantage of LazyFrame
is that the query should be checked before any computation. In the example below, I make a wrong operation by comparing a character column (tailnum
) to a numeric value. This should raise an error before any computation takes place.
However, the error takes a long time (~18 sec) to appear when I run the query on the LazyFrame created by pl$scan_parquet()
. By comparison, I made a second LazyFrame by fetch()
ing all the rows, and then using as_polars_lf()
. Then the same query errors must faster:
library(polars)
options(polars.do_not_repeat_call = TRUE)
# create large parquet file
parquet_dest <- tempfile(fileext = ".parquet")
large_data <- data.table::rbindlist(rep(list(nycflights13::flights), 100))
dim(large_data)
#> [1] 33677600 19
pl$DataFrame(large_data)$write_parquet(parquet_dest)
remove(large_data)
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 785809 42.0 1434750 76.7 1434750 76.7
#> Vcells 6534825 49.9 428124730 3266.4 512235533 3908.1
# scan this parquet file
test <- pl$scan_parquet(parquet_dest)
system.time({
test$with_columns(
pl$col("tailnum")$str$to_lowercase(),
bar = pl$col("tailnum")$str$starts_with("N8")
)$
filter(pl$col("origin") == "EWR", pl$col("tailnum") > 2)$
collect()
})
#> Error: Execution halted with the following contexts
#> 0: In R: in $collect():
#> 1: Encountered the following error in Rust-Polars:
#> cannot compare string with numeric data
#> Timing stopped at: 29.86 40.78 17.82
# fetch all rows and remake a LazyFrame
test2 <- test$fetch(33677600) |>
as_polars_lf()
system.time({
test2$with_columns(
pl$col("tailnum")$str$to_lowercase(),
bar = pl$col("tailnum")$str$starts_with("N8")
)$
filter(pl$col("origin") == "EWR", pl$col("tailnum") > 2)$
collect()
})
#> Error: Execution halted with the following contexts
#> 0: In R: in $collect():
#> 1: Encountered the following error in Rust-Polars:
#> cannot compare string with numeric data
#> Timing stopped at: 3.22 0.34 3.1
Looks like pl$col("tailnum")$str$to_lowercase()
messes up with the internal schema detection. Need to check how this behaves in py-polars
I think it is an upstream issue, as it reproduces in Python.
test.with_columns(pl.col("tailnum").str.to_lowercase()).filter(pl.col("origin") == "EWR", pl.col("tailnum") > 2).collect()
v.s.
test_collected.lazy().with_columns(pl.col("tailnum").str.to_lowercase()).filter(pl.col("origin") == "EWR", pl.col("tailnum") > 2).collect()
https://github.com/pola-rs/polars/issues/14808
Solved upstream