arrow
arrow copied to clipboard
[R] `Invalid metadata$r` warning when feeding parquet file into dplyr
Describe the bug, including details regarding any error messages, version, and platform.
Hi
I have a parquet file (https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0) which is causing an Invalid metadata$r warning. It seems to be working fine, but the warning is annoying.
The file is written from R as part of a partitioning database, and the error occurs with others as well. Please find the code and the link to the file at the end.
> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.3.3 (2024-02-29)
os macOS Sonoma 14.4
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2024-03-08
pandoc 3.1.12.2 @ /opt/homebrew/bin/pandoc
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.1)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1)
htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.1)
httpuv 1.6.14 2024-01-26 [1] CRAN (R 4.3.1)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
pkgbuild 1.4.3 2023-12-10 [1] CRAN (R 4.3.1)
pkgload 1.3.4 2024-01-16 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.0)
purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.1)
remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.0)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.8.0 2023-11-17 [1] CRAN (R 4.3.1)
stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.1)
stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis 2.2.3 2024-02-19 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
[1] /Users/rainerkrug/R/library/aarch64-apple-darwin20/4.3
[2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
─────────────
arrow::write_dataset(
data,
path = arrow_dir,
partitioning = "publication_year" ,
format = "parquet",
existing_data_behavior = "overwrite"
)
> arrow::open_dataset("./data/corpus/publication_year=1500/part-0.parquet") |> dplyr::group_by(author_abbr)
FileSystemDataset (query)
id: string
author: string
ab: string
doi: string
topics: string
author_abbr: string
* Grouped by author_abbr
See $.data for the source Arrow object
Warning message:
Invalid metadata$r
>
The Parquet file can be downloaded from: https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0
Component(s)
R
Thanks @rkrug, I was able to reproduce the error with the data you provided. I'll have a look soon and report back.