arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Benchmarking][R] conbench is failing

Open kou opened this issue 6 months ago • 9 comments

Describe the bug, including details regarding any error messages, version, and platform.

For example: https://github.com/apache/arrow/pull/46689#issuecomment-2940438967

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

https://conbench.ursa.dev/benchmark-results/0683fc4515eb7b92800052ea91af1f28/

 error


                
                  
                    ['',
 'Attaching package: ‘arrowbench’',
 '',
 'The following object is masked from ‘package:base’:',
 '',
 '    %||%',
 '',
 'Error: NotImplemented: Unhandled type for Arrow to Parquet schema '
 'conversion: decimal64(15, 2)',
 'Execution halted']
                  
                
              

Component(s)

Benchmarking

kou avatar Jun 05 '25 00:06 kou

@jonkeane @thisisnic @amoeba This may be related to only R.

kou avatar Jun 05 '25 00:06 kou

I think the root cause of this can be seen in the PR which added 32 and 64 bit decimals (https://github.com/apache/arrow/pull/43957). From that description:

Are there any user-facing changes?

Currently if a user is using decimal(precision, scale) rather than decimal128(precision, scale) they will get a Decimal128Type if the precision is <= 38 (max precision for Decimal128) and Decimal256Type if the precision is higher. Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality:

for precisions [1 : 9] => Decimal32Type for precisions [10 : 18] => Decimal64Type for precisions [19 : 38] => Decimal128Type for precisions [39 : 76] => Decimal256Type

While many of our tests currently make the assumption that decimal with a low precision would be Decimal128 and had to be updated, this may cause an initial surprise if users are making the same assumptions.

Implementing the Decimal 32 and 64 types in R will fix this. Alternatively, we could update the benchmark code.

thisisnic avatar Jun 05 '25 12:06 thisisnic

Hmm, I thought this would be resolved by implementing those types in #46720 but the benchmarks are still failing on that PR. Now I wonder if we actually do wanna update the benchmark code instead actually.

Unsure if the real dependency is actually #45351

thisisnic avatar Jun 17 '25 08:06 thisisnic

~Does anyone know where I can find the code for the R benchmarks? I've have had a look but keep finding repos with lots of text and directories but no obvious way to find the code~ - happy to take on a bit of documentation around this once I know where stuff is.

Edit: never mind, found it: https://github.com/voltrondata-labs/arrowbench/tree/main/R

thisisnic avatar Jun 17 '25 08:06 thisisnic

@jonkeane - I'm trying to work out where stuff is likely triggering the benchmarking error above, but it's really hard to follow; would you mind taking a quick look and pointing me in the right direction? My thoughts are that it looks like we are trying to write to Parquet a field that is decimal64 which isn't yet fully implemented (in Parquet?) so we either need to implement something, or temporarily adjust the benchmarks (maybe just force a high bit-width decimal), but I don't know where this is happening in the code.

thisisnic avatar Jun 17 '25 08:06 thisisnic

I haven't dug too closely, but was able to replicate this with that particular query. This is failing on every TPC-H query, yeah? If so, it's likely something to do with the data generation code

You can run this locally using the value that's in the script section of the benchmark details (I've copied / pasted here after removing the JSON cruft of [, ,, and ] (and also formatted with AIR, but it's runable without that!):

library(arrowbench)
out <- run_bm(
    format = "parquet",
    scale_factor = 10,
    engine = "arrow",
    memory_map = FALSE,
    query_id = 5,
    bm = structure(
        list(
            name = "tpch",
            setup = function(
                engine = "arrow",
                query_id = 1:22,
                format = c("native", "parquet"),
                scale_factor = c(1, 10),
                memory_map = FALSE,
                output = "data_frame",
                chunk_size = NULL
            ) {
                engine <- match.arg(
                    engine,
                    c("arrow", "duckdb", "duckdb_sql", "dplyr")
                )
                format <- match.arg(format, c("parquet", "feather", "native"))
                stopifnot(
                    "query_id must be an int" = query_id %% 1 == 0,
                    "query_id must 1-22" = query_id >= 1 & query_id <= 22
                )
                output <- match.arg(output, c("arrow_table", "data_frame"))
                library("dplyr", warn.conflicts = FALSE)
                collect_func <- collect
                if (output == "data_frame") {
                    collect_func <- collect
                } else if (output == "arrow_table") {
                    collect_func <- compute
                }
                con <- NULL
                if (engine %in% c("duckdb", "duckdb_sql")) {
                    con <- DBI::dbConnect(duckdb::duckdb())
                    DBI::dbExecute(
                        con,
                        paste0("PRAGMA threads=", getOption("Ncpus"))
                    )
                }
                BenchEnvironment(
                    input_func = get_input_func(
                        engine = engine,
                        scale_factor = scale_factor,
                        query_id = query_id,
                        format = format,
                        con = con,
                        memory_map = memory_map,
                        chunk_size = chunk_size
                    ),
                    query = get_query_func(query_id, engine),
                    engine = engine,
                    con = con,
                    scale_factor = scale_factor,
                    query_id = query_id,
                    collect_func = collect_func
                )
            },
            before_each = quote({
                result <- NULL
            }),
            run = quote({
                result <- query(input_func, collect_func, con)
            }),
            after_each = quote({
                if (scale_factor %in% c(0.01, 0.10000000000000001, 1, 10)) {
                    answer <- tpch_answer(scale_factor, query_id)
                    result <- dplyr::as_tibble(result)
                    all_equal_out <- waldo::compare(
                        result,
                        answer,
                        tolerance = 0.01
                    )
                    if (length(all_equal_out) != 0) {
                        warning(paste0("\n", all_equal_out, "\n"))
                        stop("The answer does not match")
                    }
                } else {
                    warning(
                        "There is no validation for scale_factors other than 0.01, 0.1, 1, and 10. Be careful with these results!"
                    )
                }
                result <- NULL
            }),
            teardown = quote({
                if (!is.null(con)) {
                    DBI::dbDisconnect(con, shutdown = TRUE)
                }
            }),
            valid_params = function(params) {
                drop <- (params$engine != "arrow" &
                    params$format == "feather") |
                    (params$engine != "arrow" &
                        params$output == "arrow_table") |
                    (params$engine != "arrow" &
                        params$memory_map == TRUE) |
                    (params$engine == "dplyr" & params$format == "native")
                params[!drop, ]
            },
            case_version = function(params) {
                NULL
            },
            batch_id_fun = function(params) {
                batch_id <- uuid()
                paste0(
                    batch_id,
                    "-",
                    params$scale_factor,
                    substr(params$format, 1, 1)
                )
            },
            tags_fun = function(params) {
                params$query_id <- sprintf("TPCH-%02d", params$query_id)
                if (!is.null(params$output) && params$output == "data_frame") {
                    params$output <- NULL
                }
                params
            },
            packages_used = function(params) {
                c(params$engine, "dplyr", "lubridate")
            }
        ),
        class = "Benchmark"
    ),
    n_iter = 1,
    batch_id = NULL,
    profiling = FALSE,
    global_params = list(cpu_count = NULL, lib_path = "latest"),
    run_id = NULL,
    run_name = NULL,
    run_reason = NULL
)
cat(" ##### RESULTS FOLLOW ")
cat(out$json)
cat(" ##### RESULTS END ")

I got: Error: NotImplemented: Unhandled type for Arrow to Parquet schema conversion: decimal64(15, 2)

Before I ran I needed to install arrow (I used our released/CRAN version) along with arrowbench (remotes::install_github("voltrondata-labs/arrowbench")). But ^^^ is self-contained and will do what it needs to to construct the dataset + run the query.

As extra evidence that this is in the data generation code: when I ran this, in the data directory in my working directory (which arrowbench uses to keep data, there was a single parquet file customer that also is 0 bytes (so failed to write). I would have expected 8 files, one for teach TPC-H table.

jonkeane avatar Jun 17 '25 12:06 jonkeane

Thanks @jonkeane, that should be everything I need to progress this!

thisisnic avatar Jun 17 '25 13:06 thisisnic

I'm out of time for this week to work on this, so I'm going to have to come back to it next week.

thisisnic avatar Jun 18 '25 11:06 thisisnic

I have a PR up here - not been able to test it out locally though: https://github.com/voltrondata-labs/arrowbench/pull/139

@assignUser - mind taking a look; I think it needs someone with VD credentials to run on the CI there?

thisisnic avatar Jun 25 '25 05:06 thisisnic