r-polars icon indicating copy to clipboard operation
r-polars copied to clipboard

Why does query error takes so much time to appear on data from `scan_*`?

Open etiennebacher opened this issue 1 year ago • 3 comments

One advantage of LazyFrame is that the query should be checked before any computation. In the example below, I make a wrong operation by comparing a character column (tailnum) to a numeric value. This should raise an error before any computation takes place.

However, the error takes a long time (~18 sec) to appear when I run the query on the LazyFrame created by pl$scan_parquet(). By comparison, I made a second LazyFrame by fetch()ing all the rows, and then using as_polars_lf(). Then the same query errors must faster:

library(polars)
options(polars.do_not_repeat_call = TRUE)

# create large parquet file

parquet_dest <- tempfile(fileext = ".parquet")
large_data <- data.table::rbindlist(rep(list(nycflights13::flights), 100))
dim(large_data)
#> [1] 33677600       19
pl$DataFrame(large_data)$write_parquet(parquet_dest)

remove(large_data)
gc()
#>           used (Mb) gc trigger   (Mb)  max used   (Mb)
#> Ncells  785809 42.0    1434750   76.7   1434750   76.7
#> Vcells 6534825 49.9  428124730 3266.4 512235533 3908.1

# scan this parquet file

test <- pl$scan_parquet(parquet_dest)

system.time({
  test$with_columns(
    pl$col("tailnum")$str$to_lowercase(),
    bar = pl$col("tailnum")$str$starts_with("N8")
  )$
    filter(pl$col("origin") == "EWR", pl$col("tailnum") > 2)$
    collect()
})
#> Error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    1: Encountered the following error in Rust-Polars:
#>          cannot compare string with numeric data
#> Timing stopped at: 29.86 40.78 17.82

# fetch all rows and remake a LazyFrame
test2 <- test$fetch(33677600) |> 
  as_polars_lf()

system.time({
  test2$with_columns(
    pl$col("tailnum")$str$to_lowercase(),
    bar = pl$col("tailnum")$str$starts_with("N8")
  )$
    filter(pl$col("origin") == "EWR", pl$col("tailnum") > 2)$
    collect()
})
#> Error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    1: Encountered the following error in Rust-Polars:
#>          cannot compare string with numeric data
#> Timing stopped at: 3.22 0.34 3.1

etiennebacher avatar Feb 18 '24 16:02 etiennebacher

Looks like pl$col("tailnum")$str$to_lowercase() messes up with the internal schema detection. Need to check how this behaves in py-polars

etiennebacher avatar Feb 18 '24 16:02 etiennebacher

I think it is an upstream issue, as it reproduces in Python.

test.with_columns(pl.col("tailnum").str.to_lowercase()).filter(pl.col("origin") == "EWR", pl.col("tailnum") > 2).collect()

v.s.

test_collected.lazy().with_columns(pl.col("tailnum").str.to_lowercase()).filter(pl.col("origin") == "EWR", pl.col("tailnum") > 2).collect()

eitsupi avatar Feb 19 '24 15:02 eitsupi

https://github.com/pola-rs/polars/issues/14808

etiennebacher avatar Mar 01 '24 14:03 etiennebacher

Solved upstream

etiennebacher avatar May 19 '24 07:05 etiennebacher