pins-r
pins-r copied to clipboard
support manifest-files for `board_url()`
fix #627
This is just a placeholder for now, as I sketch out some ideas.
The current signature for board_url() is:
board_url <- function(urls, cache = NULL, use_cache_on_failure = is_interactive())
where urls is a named character vector of URLs ...
I know the proposal was to add a manifest argument - having made some sketches, it seems to me that this would be a complementary argument to urls, i.e. you would use one or the other. That being the case, would you consider overloading the urls argument?
- unnamed character scalar, i.e. a single URL: this would be the URL to the manifest file.
- named character vector, i.e. what's there now: no change in behavior (but no support for versions)
- named list where elements are unnamed character vectors: this would be supplying the manifest directly, including support for versions.
I am also thinking to provide a first version of a manifest-writing function. I can provide one for board_folder(), but I don't know enough (anything, really) about the other back ends to write for any of them.
I could be getting way ahead of myself here, but I had another look at pins-python's board_urls():
def board_urls(path: str, pin_paths: dict, cache=DEFAULT, allow_pickle_read=None):
This seems closer to being able to support a manifest: path could point to a url directory with a pins.txt manifest, and pin_paths could be left unspecified.
I don't know how you might value R <=> Python correspondence vs. backwards compatibility, but I suspect you have to support backwards compatibility.
There does appear to be an escape hatch, though - but it might be a cheap trick. Given that the boards are named a little bit differently: board_url() vs. board_urls(), could there be a new R board: board_urls(), which would support a manifest and versioning, and hew more closely to its Python counterpart?
The idea of a new board_urls() is tempting but I know we didn't end up with a difference there on purpose 😆 so it might be better in the long run to make a Python board_url() and gradually deprecate the existing Python function. @machow thoughts?
Now that I am looking at this more closely, I think that adding a manifest argument runs into the same problem of dependencies between arguments that I brought up in #649 so if we decide to keep board_url() only, I am in favor of:
- unnamed character scalar, i.e. a single URL: this would be the URL to the manifest file.
- named character vector, i.e. what's there now: no change in behavior (but no support for versions)
- named list where elements are unnamed character vectors: this would be supplying the manifest directly, including support for versions.
Super!
I will rework things along these lines, and keep an eye open for ideas from @machow
so it might be better in the long run to make a Python board_url() and gradually deprecate the existing Python function
Yesss -- sorry, the name difference between R and python pins is one I should have nipped in the bud! 😅
I've got a first draft of the "reading" part of the work - a board_url() can take an URL that points to a manifest file, and support versions:
library("pins")
# looks for pins.txt
url <- "https://raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/"
url_board <- board_url(url)
pin_list(url_board)
#> [1] "mtcars-csv" "mtcars-json"
pin_versions(url_board, "mtcars-json")
#> # A tibble: 2 × 3
#> version created hash
#> <chr> <dttm> <chr>
#> 1 20220811T155803Z-c2702 2022-08-11 10:58:03 c2702
#> 2 20220811T155805Z-c2702 2022-08-11 10:58:05 c2702
pin_meta(url_board, "mtcars-json")
#> List of 11
#> $ file : chr "mtcars-json.json"
#> $ file_size : 'fs_bytes' int 4.05K
#> $ pin_hash : chr "c2702fb156ba5c38"
#> $ type : chr "json"
#> $ title : chr "mtcars-json: a pinned 32 x 11 data frame"
#> $ description: NULL
#> $ created : POSIXct[1:1], format: "2022-08-11 10:58:05"
#> $ api_version: num 1
#> $ user : list()
#> $ name : chr "mtcars-json"
#> $ local :List of 4
#> ..$ dir : 'fs_path' chr "~/Library/Caches/pins/url/2bc8541a31222efa41742167e6878a1e"
#> ..$ url : chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155805Z-c2702/"
#> ..$ version : NULL
#> ..$ file_url: chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155805Z-c2"| __truncated__
pin_meta(url_board, "mtcars-json", version = "20220811T155803Z-c2702")
#> List of 11
#> $ file : chr "mtcars-json.json"
#> $ file_size : 'fs_bytes' int 4.05K
#> $ pin_hash : chr "c2702fb156ba5c38"
#> $ type : chr "json"
#> $ title : chr "mtcars-json: a pinned 32 x 11 data frame"
#> $ description: NULL
#> $ created : POSIXct[1:1], format: "2022-08-11 10:58:03"
#> $ api_version: num 1
#> $ user : list()
#> $ name : chr "mtcars-json"
#> $ local :List of 4
#> ..$ dir : 'fs_path' chr "~/Library/Caches/pins/url/cfcb2a81d2d44dc0e81e85175589fb52"
#> ..$ url : chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155803Z-c2702/"
#> ..$ version : NULL
#> ..$ file_url: chr "https:/raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-json/20220811T155803Z-c2"| __truncated__
pin_read(url_board, "mtcars-json") |> head()
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Created on 2022-09-15 by the reprex package (v2.0.1)
I wanted to give you a chance to react - there's still a lot to do in terms of documentation, testing, error messages, and also to propose a manifest-writing helper function.
Thanks!
I have added some error-handling:
library("pins")
github_raw <- function(x) paste0("https://raw.githubusercontent.com/", x)
# failed request
board_url("https://not_real_url.posit.co")
#> Error in `board_url()`:
#> ! Error requesting manifest-file from URL
#> <https://not_real_url.posit.co>:
#> Received HTTP code 502 from proxy after CONNECT
# file not found
board_url(github_raw("rstudio/pins-r/master/tests/testthat/pin-rds/"))
#> Error in `board_url()`:
#> ! Error requesting manifest-file from URL
#> <https://raw.githubusercontent.com/rstudio/pins-r/master/tests/testthat/pin-rds/pins.txt>:
#> Not Found (HTTP 404).
# file not parsable YAML
board_url(github_raw("ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-csv/20220811T155805Z-48c73/mtcars-csv.csv"))
#> Error in `board_url()`:
#> ! Error parsing manifest-file at URL
#> <https://raw.githubusercontent.com/ijlyttle/pinsManifest/main/tests/testthat/pins/mtcars-csv/20220811T155805Z-48c73/mtcars-csv.csv>:
#> Parser error: did not find expected <document start> at line 1, column 6
#> ℹ Manifest file must be text and parsable as YAML.
# not supported format
board_url(3)
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` is a "numeric".
# unnamed list
board_url(list("a", "b"))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a list:
#> ℹ - named: FALSE
#> ℹ - all values character: TRUE
# list values not all character
board_url(list(a = "a", b = 2))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a list:
#> ℹ - named: TRUE
#> ℹ - all values character: FALSE
# unnamed character vector
board_url(c("a", "b"))
#> Error in `board_url()`:
#> ! `urls` must resolve to either:
#> • unnamed character scalar, i.e. a URL
#> • named character vector
#> • named list, where all elements are character scalars or vectors
#> ℹ `urls` resolves to a character vector, but is unnamed.
Created on 2022-09-18 by the reprex package (v2.0.1)
proposed function pin_manifest(), which can write mainfest files for board_folder() - if accepted, could expand to board_s3(), ...
library("pins")
board <- board_temp()
pin_write(board, mtcars, "mtcars-csv", type = "csv")
#> Creating new version '20220919T194404Z-48c73'
#> Writing to pin 'mtcars-csv'
pin_write(board, mtcars, "mtcars-json", type = "json")
#> Creating new version '20220919T194404Z-c2702'
#> Writing to pin 'mtcars-json'
pin_manifest(board)
#> Manifest file written to root-folder of board, as 'pins.txt'.
fs::path(board$path, "pins.txt") %>% readLines() %>% cat(sep = "\n")
#> mtcars-csv:
#> - mtcars-csv/20220919T194404Z-48c73/
#> mtcars-json:
#> - mtcars-json/20220919T194404Z-c2702/
Created on 2022-09-19 by the reprex package (v2.0.1)
Created on 2022-09-19 by the reprex package (v2.0.1)
This is looking good, and I think is headed the right direction! 🙌
I would like to see the pin_manifest() function more like pin_write() in that it is a regular function that calls methods for a specific board internally. Perhaps what is currently in pin_manifest_internal() mostly becomes the manifest creation function, and then we have a generic for pins_manifest_uploader. Maybe something like S3 can reuse the s3_upload_yaml() function we already have. That is more consistent with the current interface and also I think will let us more easily extend this and/or consider other arguments, like some that renv::snapshot() has such as prompt or update.
How confident are you that using a new yaml filename is the right call, vs. making a board-level data.txt, in addition to the pin-level data.txt files? Using the name of the file to decide whether we are looking at a board or a pin seems slightly iffy to me maybe?
FYI the legacy API had support for a manifest, and this led Javier to want to work on a data.txt spec. I don't think we are going to pursue that, but I do think it's good to look over that older work and think about how this problem has come up multiple times (obviously a need!).
I need to think a bit more about the form of the function (your first point) before saying anything, but I did want to offer some rationale for why I proposed pins.txt.
I interpreted the name data.txt as a file to describe data, thus proposing pins.txt as a file to describe pins. I also see that data.txt is a reserved name, I didn't want to conflict with that.
For example, the current behavior of board_url() is to treat a directory that has a data.txt file as the directory of a pin-version:
https://github.com/rstudio/pins-r/blob/5c4beade1887200e72f290be1e0b08f4842daa0a/R/board_url.R#L83-L91
So if I called:
# incorrectly supplying the path to a board, rather than to a pin-version
board_url(c(file = "http://pins.com/path/to/board/"))
If data.txt were the name of the pins-manifest, I'd get some weird behavior when read_meta() tried to read such a file.
If it were named pins.txt, I imagine I'd get a "cannot find data.txt" error, which could be easier to reason about.
Update: Reading more about Javier's data.txt spec - this looks really neat; I'll need some more coffee to wrap my head around this particular question.
I think we were/are on the same page for extending pins_manifest() for other boards; I also think I got too cute with the implementation. Hopefully this is closer to what you have in mind :)
At some point, I'll want to propose tests - but I don't have accounts on these platforms. I imagine the CI does, though.
Leave this alive (mostly dead?) for the time being?
Yeah @ijlyttle let's leave this open and draft for now. 👍
This is all implemented in #661 and #681.
Big thanks to @juliasilge and @machow for being so patient and generous with your talents!
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.