duckdbfs icon indicating copy to clipboard operation
duckdbfs copied to clipboard

Support `hive_types` arg in open_dataset

Open grantmcdermott opened this issue 2 years ago • 2 comments

Just ran into an issue where the Hive partition types are not being registered correctly out of the gate. In particular, integer or numeric folder divisions like month=1 are being interpreted as characters (month -> "1"), which limits the ability to filter effectively.

Example, using the nyc-taxi-tiny dataset from Voltron data.

library(duckdbfs)
library(dplyr)

duckdb_s3_config()
taxi = open_dataset("s3://voltrondata-labs-datasets/nyc-taxi-tiny")

# Assuming numeric doesn't work
taxi |> filter(year<=2010 & month<3) |> head(2) |> collect()

#' Error in `collect()`:
#' ! Failed to collect lazy table.
#' Caused by error:
#' ! rapi_execute: Failed to run query
#' Error: Conversion Error: Could not convert string "10" to DECIMAL(2,1)
#' Traceback:
#' 
#' 1. collect(head(filter(taxi, year <= 2010 & month < 3), 2))
#' 2. collect.tbl_sql(head(filter(taxi, year <= 2010 & month < 3), 
#'  .     2))
#' 3. tryCatch(out <- db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete), 
#'  .     error = function(cnd) {
#'  .         cli_abort("Failed to collect lazy table.", parent = cnd)
#'  .     })
#' 4. tryCatchList(expr, classes, parentenv, handlers)
#' 5. tryCatchOne(expr, names, parentenv, handlers[[1L]])
#' 6. value[[3L]](cond)
#' 7. cli_abort("Failed to collect lazy table.", parent = cnd)
#' 8. rlang::abort(message, ..., call = call, use_cli_format = TRUE, 
#'  .     .frame = .frame)
#' 9. signal_abort(cnd, .file)

# Character coercion does
taxi |> filter(year %in% 2009:2010 & month %in% 1:2) |> head(2) |> collect()

#' # A tibble: 2 × 24
#'   vendor_name pickup_datetime     dropoff_datetime    passenger_count
#'   <chr>       <dttm>              <dttm>                        <dbl>
#' 1 VTS         2009-01-05 18:57:00 2009-01-05 19:04:00               5
#' 2 VTS         2009-01-28 12:01:00 2009-01-28 12:04:00               1
#' # ℹ 20 more variables: trip_distance <dbl>, pickup_longitude <dbl>,
#' #   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,
#' #   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,
#' #   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,
#' #   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,
#' #   congestion_surcharge <dbl>, pickup_location_id <dbl>,
#' #   dropoff_location_id <dbl>, month <chr>, year <chr>


Just to preemptively note then, that this issue is been addressed in the upcoming 0.9.0 release (see: PR and docs). So this would be nice to add an as explicit argument to open_dataset().

grantmcdermott avatar Sep 19 '23 19:09 grantmcdermott

ok this is cool, thanks for the heads up! Wasn't clear from the PR, is the auto_guess added by default or needs a flag or was that behavior dropped?

In arrow, I believe hive types have always been characters, though arrow gives you the option to specify the overall schema (which means it can attempt to coerce types to other than the parquet encoding, which can be handy in some edge-cases!) Erroneously, duckdbfs::open_dataset has the schema argument but it doesn't do anything! Should probably be dropped unless that becomes a duckdb feature. I don't think arrow lets you specify the types only of the hive partition while taking the rest from the parquet though.

Honestly I kinda wish this information could remain encoded in the parquet partitions and read from that (even retaining the column in the parquet partition, since with compression the increased storage cost of a column with the same value throughout should be tiny). Having data types attached to data rather than having to be manually handled by who-ever reads the data in seems ideal! (I have sometimes hacked just this by generating hive-partitions 'manually' in a loop, no doubt ill-advised).

cboettig avatar Sep 20 '23 04:09 cboettig

Honestly I kinda wish this information could remain encoded in the parquet partitions and read from that (even retaining the column in the parquet partition, since with compression the increased storage cost of a column with the same value throughout should be tiny). Having data types attached to data rather than having to be manually handled by who-ever reads the data in seems ideal! (I have sometimes hacked just this by generating hive-partitions 'manually' in a loop, no doubt ill-advised).

Strong agree. That's one advantage of going via arrow::open_dataset(DIR) |> to_duckdb() |> ... |> collect() in the equivalent dplyr workflow: it preserves types correctly by referencing the metadata.

grantmcdermott avatar Sep 20 '23 20:09 grantmcdermott