arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[R] `Invalid metadata$r` warning when feeding parquet file into dplyr

Open rkrug opened this issue 1 year ago • 2 comments

Describe the bug, including details regarding any error messages, version, and platform.

Hi I have a parquet file (https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0) which is causing an Invalid metadata$r warning. It seems to be working fine, but the warning is annoying.

The file is written from R as part of a partitioning database, and the error occurs with others as well. Please find the code and the link to the file at the end.

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       macOS Sonoma 14.4
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2024-03-08
 pandoc   3.1.12.2 @ /opt/homebrew/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version  date (UTC) lib source
 arrow       * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
 assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.3.0)
 bit           4.0.5    2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5    2020-08-30 [1] CRAN (R 4.3.0)
 cachem        1.0.8    2023-05-01 [1] CRAN (R 4.3.0)
 cli           3.6.2    2023-12-11 [1] CRAN (R 4.3.1)
 devtools      2.4.5    2022-10-11 [1] CRAN (R 4.3.0)
 digest        0.6.34   2024-01-11 [1] CRAN (R 4.3.1)
 ellipsis      0.3.2    2021-04-29 [1] CRAN (R 4.3.0)
 fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.3    2023-07-20 [1] CRAN (R 4.3.0)
 glue          1.7.0    2024-01-09 [1] CRAN (R 4.3.1)
 htmltools     0.5.7    2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.4    2023-12-06 [1] CRAN (R 4.3.1)
 httpuv        1.6.14   2024-01-26 [1] CRAN (R 4.3.1)
 jsonlite      1.8.8    2023-12-04 [1] CRAN (R 4.3.1)
 later         1.3.2    2023-12-06 [1] CRAN (R 4.3.1)
 lifecycle     1.0.4    2023-11-07 [1] CRAN (R 4.3.1)
 magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
 memoise       2.0.1    2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12     2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1  2018-05-18 [1] CRAN (R 4.3.0)
 pkgbuild      1.4.3    2023-12-10 [1] CRAN (R 4.3.1)
 pkgload       1.3.4    2024-01-16 [1] CRAN (R 4.3.1)
 profvis       0.3.8    2023-05-02 [1] CRAN (R 4.3.0)
 promises      1.2.1    2023-08-10 [1] CRAN (R 4.3.0)
 purrr         1.0.2    2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.12   2024-01-09 [1] CRAN (R 4.3.1)
 remotes       2.4.2.1  2023-07-18 [1] CRAN (R 4.3.0)
 rlang         1.1.3    2024-01-10 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
 shiny         1.8.0    2023-11-17 [1] CRAN (R 4.3.1)
 stringi       1.8.3    2023-12-11 [1] CRAN (R 4.3.1)
 stringr       1.5.1    2023-11-14 [1] CRAN (R 4.3.1)
 tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1    2021-11-30 [1] CRAN (R 4.3.0)
 usethis       2.2.3    2024-02-19 [1] CRAN (R 4.3.1)
 vctrs         0.6.5    2023-12-01 [1] CRAN (R 4.3.1)
 xtable        1.8-4    2019-04-21 [1] CRAN (R 4.3.0)

 [1] /Users/rainerkrug/R/library/aarch64-apple-darwin20/4.3
 [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

─────────────
arrow::write_dataset(
            data, 
            path = arrow_dir,
            partitioning = "publication_year" ,
            format = "parquet",
            existing_data_behavior = "overwrite"
        )
> arrow::open_dataset("./data/corpus/publication_year=1500/part-0.parquet") |> dplyr::group_by(author_abbr)
FileSystemDataset (query)
id: string
author: string
ab: string
doi: string
topics: string
author_abbr: string

* Grouped by author_abbr
See $.data for the source Arrow object
Warning message:
Invalid metadata$r 
> 

The Parquet file can be downloaded from: https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0

Component(s)

R

rkrug avatar Mar 08 '24 13:03 rkrug

Thanks @rkrug, I was able to reproduce the error with the data you provided. I'll have a look soon and report back.

amoeba avatar Mar 08 '24 19:03 amoeba