cookbook-rpolars
cookbook-rpolars copied to clipboard
Unfairness in benchmarks
Hello!
First, I'd like to say thanks for a great book and bringing knowledge about polars to R community. I do have a concern about benchmarks in "From an R object" section though.
Currently you are pre-initializing polars object before running your query, while not converting data.frame to data.table or to duckdb / arrow. https://github.com/ddotta/cookbook-rpolars/blob/e1374f9ea2ae89d177f175d61c3d22a290438cb5/book/content/benchmarking/_from_r_object.qmd#L14-L17
One could argue that DataMultiTypes_pl
is not more of an R object than duckdb connection, as both are external references and can't be directly serialized to RDS. Creating a data.table object also takes additional time (albeit negligible compared to polars and duckdb).
So I propose either starting all benchmarks from base data.frame or pre-initializing all objects and connections.
In my testing I also uncovered the fact that polars has substantial initialization overhead, compared to duckdb, thus moving it down in ranks if initialization happens inside of the tested call.
I agree. DuckDB has the ability to query directly against R data frames (and arrow Tables), but nothing else. So in general, DuckDB tends to be the fastest for querying R data frames.