duckplyr
duckplyr copied to clipboard
duckplyr lost integer64 support
Did duckplyr lose support for bit64::integer64() ?
It used to work at some point in the past.
Reprex with duckplyr 1.0.0, duckdb 1.1.3-2
duckplyr::duckdb_tibble(x = bit64::as.integer64(10))
#> The duckplyr package is configured to fall back to dplyr when it encounters an
#> incompatibility. Fallback events can be collected and uploaded for analysis to
#> guide future development. By default, data will be collected but no data will
#> be uploaded.
#> ℹ Automatic fallback uploading is not controlled and therefore disabled, see
#> `?duckplyr::fallback()`.
#> ✔ Number of reports ready for upload: 1.
#> → Review with `duckplyr::fallback_review()`, upload with
#> `duckplyr::fallback_upload()`.
#> ℹ Configure automatic uploading with `duckplyr::fallback_config()`.
#> Error in `duckplyr::duckdb_tibble()`:
#> ! Can't convert columns of class <integer64> to relational. Affected
#> column: `x`.
Created on 2025-02-12 with reprex v2.1.1
Thanks, interesting. Support was never intentional, AFAICT. We might pass through with #162, proper support with full compatibility will be challenging.
What should be supported with an integer64 column in duckplyr?
In my dream world every operation that is supported by bit64 is supported by duckplyr. duckdb's BIGINT should be mapped to R's bit64::as.integer64(), and vica versa.
But more to the point though, my current specifc use cases are:
- I process data that includes a primary key column which is
BIGINT(and the actual range actually goes beyondINT32, so I can't simply downcast it). I need the column though to join to other data sets. I am hestitant to simply convert it to a floating point format, since I am unsure about the ramifications when performing the equality joins (I seem to remember that equality checks for floating point data are kinda fishy). - I perform sum aggregation on integer columns, which might be too big for
INT32. (I guess in this case converting to floating point might be a suitable workaround though).
Note that currently there is also a difference in behavior between dplyr and duckplyr when an integer overflow is produced: dplyr warns, while duckplyr errs. (should this be a separate issue?)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tibble(x=!!.Machine$integer.max) |>
mutate(
y = x + 1L
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `y = x + 1L`.
#> Caused by warning in `x + 1L`:
#> ! NAs produced by integer overflow
#> # A tibble: 1 × 2
#> x y
#> <int> <int>
#> 1 2147483647 NA
duckplyr::duckdb_tibble(x=!!.Machine$integer.max, .prudence = "stingy") |>
mutate(
y = x + 1L
)
#> # A duckplyr data frame: 2 variables
#> Error: Error evaluating duckdb query: Out of Range Error: Overflow in addition of INT32 (2147483647 + 1)!
Created on 2025-02-23 with reprex v2.1.1
Re overflow error: I think R should be failing and duckplyr has it right 🙃. This is an inconsistency we could list in vignette("limits") .
As long as there is no other integer that has the same floating-point representation, equality comparison should be fine. I believe this is true if the highest 11 or so bits of your integer values are zero. If you need the full range of 64-bit integers, I agree that converting to floating-point values isn't great.
sum() for int32 values will be always returned as numeric by duckplyr.
Once the data is in DuckDB, you can use it. It comes back as a numeric:
duckplyr::read_sql_duckdb(
"SELECT *
FROM (VALUES
(9223372036854775807), -- Maximum int64 value
(-9223372036854775808), -- Minimum int64 value
(2147483648), -- Just above int32 max
(-2147483649), -- Just below int32 min
(1152921504606846976) -- Large positive int64 value
) AS t(large_number)"
)
#> # A duckplyr data frame: 1 variable
#> large_number
#> <dbl>
#> 1 9.22e18
#> 2 -9.22e18
#> 3 2.15e 9
#> 4 -2.15e 9
#> 5 1.15e18
Created on 2025-02-23 with reprex v2.1.1
Can we implement extensible and fast custom ingestion for arbitrary R data types?
Some problems come up when using external int64 data with duckplyr::read_parquet_duckdb():
From your statement
Once the data is in DuckDB, you can use it.
I would have expected that problems can only arise when converting from duckdb world to R world. However, duckplyr::read_parquet_duckdb() seems to not treat int64 correctly either.
The following reprex checks three methods: arrow, duckdb, and duckplyr. Arrow and duckdb are correct, duckplyr gives an inconsistent result:
library(dplyr)
df <-
tibble(
x = bit64::as.integer64("485810100016468796")
) |>
mutate(
chr = as.character(x)
)
tempfile <- withr::local_tempfile()
df |>
arrow::write_parquet(tempfile)
# arrow works
arrow::read_parquet(tempfile) |>
mutate(
chr2 = as.character(x)
)
#> # A tibble: 1 × 3
#> x chr chr2
#> <int64> <chr> <chr>
#> 1 4.e17 485810100016468796 485810100016468796
# duckdb works as well
con <- duckdb::dbConnect(duckdb::duckdb(bigint = "integer64"))
duckdb::duckdb_register_arrow(con, "df", arrow::open_dataset(tempfile))
df_duckdb <- tbl(con, "df")
df_duckdb |>
mutate(
chr2 = as.character(x)
)
#> # Source: SQL [?? x 3]
#> # Database: DuckDB 1.3.3-dev231 [6999schneiderbauer@Windows 10 x64:R 4.5.1/:memory:]
#> x chr chr2
#> <int64> <chr> <chr>
#> 1 4.e17 485810100016468796 485810100016468796
DBI::dbDisconnect(con)
# duckplyr has problems
duckplyr::read_parquet_duckdb(tempfile) |>
mutate(
chr2 = as.character(x)
) |>
collect()
#> # A tibble: 1 × 3
#> x chr chr2
#> <dbl> <chr> <chr>
#> 1 4.86e17 485810100016468796 485810100016468800
Created on 2025-10-21 with reprex v2.1.1
Thanks. The conversion takes place in R, as seen below.
We'd need casting operators accessible through dd::...() . So far, I haven't managed, there might well be a way.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <-
tibble(
x = bit64::as.integer64("485810100016468796")
) |>
mutate(
chr = as.character(x)
)
tempfile <- withr::local_tempfile()
df |>
arrow::write_parquet(tempfile)
# duckplyr has problems
duckplyr::read_parquet_duckdb(tempfile, prudence = "stingy") |>
mutate(
chr2 = as.character(x)
) |>
collect()
#> Error in `mutate()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#> stingy duckplyr frame.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#> continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
#> Caused by error in `mutate()`:
#> ! Can't translate function `as.character()`.
Created on 2025-10-26 with reprex v2.1.1