duckplyr icon indicating copy to clipboard operation
duckplyr copied to clipboard

duckplyr lost integer64 support

Open lschneiderbauer opened this issue 9 months ago • 5 comments

Did duckplyr lose support for bit64::integer64() ? It used to work at some point in the past.

Reprex with duckplyr 1.0.0, duckdb 1.1.3-2

duckplyr::duckdb_tibble(x = bit64::as.integer64(10))
#> The duckplyr package is configured to fall back to dplyr when it encounters an
#> incompatibility. Fallback events can be collected and uploaded for analysis to
#> guide future development. By default, data will be collected but no data will
#> be uploaded.
#> ℹ Automatic fallback uploading is not controlled and therefore disabled, see
#>   `?duckplyr::fallback()`.
#> ✔ Number of reports ready for upload: 1.
#> → Review with `duckplyr::fallback_review()`, upload with
#>   `duckplyr::fallback_upload()`.
#> ℹ Configure automatic uploading with `duckplyr::fallback_config()`.
#> Error in `duckplyr::duckdb_tibble()`:
#> ! Can't convert columns of class <integer64> to relational. Affected
#>   column: `x`.

Created on 2025-02-12 with reprex v2.1.1

lschneiderbauer avatar Feb 12 '25 15:02 lschneiderbauer

Thanks, interesting. Support was never intentional, AFAICT. We might pass through with #162, proper support with full compatibility will be challenging.

What should be supported with an integer64 column in duckplyr?

krlmlr avatar Feb 21 '25 20:02 krlmlr

In my dream world every operation that is supported by bit64 is supported by duckplyr. duckdb's BIGINT should be mapped to R's bit64::as.integer64(), and vica versa.

But more to the point though, my current specifc use cases are:

  • I process data that includes a primary key column which is BIGINT (and the actual range actually goes beyond INT32, so I can't simply downcast it). I need the column though to join to other data sets. I am hestitant to simply convert it to a floating point format, since I am unsure about the ramifications when performing the equality joins (I seem to remember that equality checks for floating point data are kinda fishy).
  • I perform sum aggregation on integer columns, which might be too big for INT32. (I guess in this case converting to floating point might be a suitable workaround though).

Note that currently there is also a difference in behavior between dplyr and duckplyr when an integer overflow is produced: dplyr warns, while duckplyr errs. (should this be a separate issue?)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tibble(x=!!.Machine$integer.max) |> 
  mutate(
    y = x + 1L
  )
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `y = x + 1L`.
#> Caused by warning in `x + 1L`:
#> ! NAs produced by integer overflow
#> # A tibble: 1 × 2
#>            x     y
#>        <int> <int>
#> 1 2147483647    NA

duckplyr::duckdb_tibble(x=!!.Machine$integer.max, .prudence = "stingy") |> 
  mutate(
    y = x + 1L
  )
#> # A duckplyr data frame: 2 variables
#> Error: Error evaluating duckdb query: Out of Range Error: Overflow in addition of INT32 (2147483647 + 1)!

Created on 2025-02-23 with reprex v2.1.1

lschneiderbauer avatar Feb 23 '25 10:02 lschneiderbauer

Re overflow error: I think R should be failing and duckplyr has it right 🙃. This is an inconsistency we could list in vignette("limits") .

As long as there is no other integer that has the same floating-point representation, equality comparison should be fine. I believe this is true if the highest 11 or so bits of your integer values are zero. If you need the full range of 64-bit integers, I agree that converting to floating-point values isn't great.

sum() for int32 values will be always returned as numeric by duckplyr.

Once the data is in DuckDB, you can use it. It comes back as a numeric:

duckplyr::read_sql_duckdb(
  "SELECT *
  FROM (VALUES
      (9223372036854775807),  -- Maximum int64 value
      (-9223372036854775808), -- Minimum int64 value
      (2147483648),           -- Just above int32 max
      (-2147483649),          -- Just below int32 min
      (1152921504606846976)   -- Large positive int64 value
  ) AS t(large_number)"
)
#> # A duckplyr data frame: 1 variable
#>   large_number
#>          <dbl>
#> 1      9.22e18
#> 2     -9.22e18
#> 3      2.15e 9
#> 4     -2.15e 9
#> 5      1.15e18

Created on 2025-02-23 with reprex v2.1.1

Can we implement extensible and fast custom ingestion for arbitrary R data types?

krlmlr avatar Feb 23 '25 13:02 krlmlr

Some problems come up when using external int64 data with duckplyr::read_parquet_duckdb():

From your statement

Once the data is in DuckDB, you can use it.

I would have expected that problems can only arise when converting from duckdb world to R world. However, duckplyr::read_parquet_duckdb() seems to not treat int64 correctly either.

The following reprex checks three methods: arrow, duckdb, and duckplyr. Arrow and duckdb are correct, duckplyr gives an inconsistent result:

library(dplyr)

df <-
  tibble(
    x = bit64::as.integer64("485810100016468796")
  ) |> 
  mutate(
    chr = as.character(x)
  )

tempfile <- withr::local_tempfile()

df |> 
  arrow::write_parquet(tempfile)


# arrow works
arrow::read_parquet(tempfile) |> 
  mutate(
    chr2 = as.character(x)
  )
#> # A tibble: 1 × 3
#>         x chr                chr2              
#>   <int64> <chr>              <chr>             
#> 1   4.e17 485810100016468796 485810100016468796

# duckdb works as well
con <- duckdb::dbConnect(duckdb::duckdb(bigint = "integer64"))
duckdb::duckdb_register_arrow(con, "df", arrow::open_dataset(tempfile))
df_duckdb <- tbl(con, "df")
df_duckdb |> 
  mutate(
    chr2 = as.character(x)
  )
#> # Source:   SQL [?? x 3]
#> # Database: DuckDB 1.3.3-dev231 [6999schneiderbauer@Windows 10 x64:R 4.5.1/:memory:]
#>         x chr                chr2              
#>   <int64> <chr>              <chr>             
#> 1   4.e17 485810100016468796 485810100016468796
DBI::dbDisconnect(con)

# duckplyr has problems
duckplyr::read_parquet_duckdb(tempfile) |> 
  mutate(
    chr2 = as.character(x)
  ) |> 
  collect()
#> # A tibble: 1 × 3
#>         x chr                chr2              
#>     <dbl> <chr>              <chr>             
#> 1 4.86e17 485810100016468796 485810100016468800

Created on 2025-10-21 with reprex v2.1.1

lschneiderbauer avatar Oct 21 '25 10:10 lschneiderbauer

Thanks. The conversion takes place in R, as seen below.

We'd need casting operators accessible through dd::...() . So far, I haven't managed, there might well be a way.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <-
  tibble(
    x = bit64::as.integer64("485810100016468796")
  ) |>
  mutate(
    chr = as.character(x)
  )

tempfile <- withr::local_tempfile()

df |>
  arrow::write_parquet(tempfile)

# duckplyr has problems
duckplyr::read_parquet_duckdb(tempfile, prudence = "stingy") |>
  mutate(
    chr2 = as.character(x)
  ) |>
  collect()
#> Error in `mutate()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#>   stingy duckplyr frame.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#>   continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
#> Caused by error in `mutate()`:
#> ! Can't translate function `as.character()`.

Created on 2025-10-26 with reprex v2.1.1

krlmlr avatar Oct 26 '25 21:10 krlmlr