bowerbird
bowerbird copied to clipboard
how to add datasource with source_url without filename in url
Hi,
I try to create a function as bb_zenodo_source
https://github.com/ropensci/bowerbird/blob/53d7bcb56c75b10c1f42b418bec870bc0eff44df/R/zenodo.R#L23
I would like to download data from french data.gouv.fr portal.
Unfortunately lastest data are accessible for example with a url of this form : https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
When i run the sync without the extra param force_local_filename
.
the data is downloaded but I didn't have the data in the root folder. It works great with this param.
I try to create a custom function to hack the bb_hanlder_rget
function to use the force_local_fileame
extra param...
but with no success...
Any idea on how to achieve this ?
library(tidyverse)
library(fs)
library(bowerbird)
id_data_gouv <- "5e1f20058b4c414d3f94460d"
bb_data_gouv_source <- function(id) {
ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
jx <- jsonlite::fromJSON(paste0("https://www.data.gouv.fr/api/1/datasets/", id_data_gouv))
#collection sizes
csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
id <- id
description <- jx$description
folder <- jx$slug
doc_url <-jx$page
postproc <- list()
files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
id = id,
description = ne_or(jx$description, "Dataset description"),
##keywords = ne_or(jx$metadata$keywords, NA_character_),
doc_url = jx$page,
citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
source_url = jx$resources$latest, ## list all urls. Does this cover datasets with multiple buckets? (Are there such things?)
method = list("bb_data_gouv_handler_get", files = files),
comment = "Source definition created by bb_zenodo_source",
postprocess = postproc,
collection_size = csize)
}
bb_data_gouv_handler_get <- function(config, verbose = TRUE, files, ...) {
cfrow <- bowerbird:::bb_settings_to_cols(config)
this_flags <- list(...)
this_flags <- c(list(url = cfrow$source_url), list(force_local_filename = files),this_flags, list(verbose = verbose))
do.call(bb_rget, this_flags)
}
src <- bb_data_gouv_source(id_data_gouv)
cf <- bb_config(local_file_root = "~")
cf <- bb_add(cf, src )
status <- bb_sync(cf, create_root = TRUE, verbose = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
Created on 2021-10-12 by the reprex package (v2.0.1)
Hi @jdlom ... I've got to say, you have done well getting that close! This situation doesn't quite fit the scenario that bowerbird was originally designed for, which makes the solution a bit awkward. And even at the best of times the internals of how this all works are pretty obscure. Anyway, I'd suggest this:
bb_data_gouv_source <- function(id) {
ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
api_url <- paste0("https://www.data.gouv.fr/api/1/datasets/", id)
jx <- jsonlite::fromJSON(api_url)
## collection sizes
csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
id = id,
description = ne_or(jx$description, "Dataset description"),
##keywords = ne_or(jx$metadata$keywords, NA_character_),
doc_url = jx$page,
citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
source_url = api_url, ## delay the retrieval of the actual file list until sync time
method = list("bb_handler_data_gouv"),
comment = "Source definition created by bb_data_gouv_source",
postprocess = list(),
collection_size = csize)
}
bb_handler_data_gouv <- function(...) {
bb_handler_data_gouv_inner(...)
}
bb_handler_data_gouv_inner <- function(config, verbose = FALSE, local_dir_only = FALSE, ...) {
## retrieve the list of urls and their associated file names
if (local_dir_only) return(bb_handler_rget(config, verbose = verbose, local_dir_only = TRUE, ...))
target_dir <- sub("[\\/]$", "", bb_data_source_dir(config))
if (!dir.exists(target_dir)) {
ok <- dir.create(target_dir, recursive = TRUE)
if (!ok) {
stop(sprintf("Could not create target directory %s: aborting.\n", target_dir))
}
}
settings <- bowerbird:::save_current_settings()
on.exit({ bowerbird:::restore_settings(settings) })
setwd(target_dir)
ds <- bb_data_sources(config)
jx <- jsonlite::fromJSON(ds$source_url)
files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
## make the file names safe - e.g. they cannot have "/" in them
files <- fs::path_sanitize(files)
urls <- jx$resources$latest
all_ok <- TRUE
msg <- c()
downloads <- tibble(url = character(), file = character(), was_downloaded = logical())
for (i in seq_along(urls)) {
dummy <- config
temp <- bb_data_sources(dummy)
temp$source_url[[1]] <- urls[i]
bb_data_sources(dummy) <- temp
## pass to the rget handler
## we could do it directly here with GET calls, but simpler to use the rget handler functionality
this <- bb_handler_rget(dummy, verbose = verbose, level = 0, use_url_directory = FALSE, force_local_filename = files[i])
all_ok <- all_ok && this$ok
if (nrow(this$files[[1]])>0) {
this$files[[1]]$file <- file.path(target_dir, this$files[[1]]$file)
downloads <- rbind(downloads, this$files[[1]])
}
if (nzchar(this$message)) msg <- c(msg, this$message)
}
if (length(msg) < 1) msg <- ""
tibble(ok = all_ok, files = list(downloads), message = msg)
}
And then:
> src <- bb_data_gouv_source("5e1f20058b4c414d3f94460d")
> res <- bb_get(src, local_file_root = "/tmp", verbose = TRUE)
Wed Oct 13 17:04:46 2021
Synchronizing dataset: Base nationale sur les intercommunalités
Source URL https://www.data.gouv.fr/api/1/datasets/5e1f20058b4c414d3f94460d
--------------------------------------------------------------------------------------------
this dataset path is: /tmp/www.data.gouv.fr/api/1/datasets
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75 ...
|=====================================================================================================================| 100%
Downloading: 59 kB
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3 ...
|=====================================================================================================================| 100%
Downloading: 2.6 kB
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b ...
|=====================================================================================================================| 100%
|=====================================================================================================================| 100%
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2 ...
|=====================================================================================================================| 100%
Downloading: 21 MB
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392 ...
|=====================================================================================================================| 100%
Downloading: 5.6 MB
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f ...
|=====================================================================================================================| 100%
Downloading: 3.5 MB
done.
downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6 ...
|=====================================================================================================================| 100%
Downloading: 2.1 MB
done.
Wed Oct 13 17:06:28 2021 dataset synchronization complete: Base nationale sur les intercommunalités
> res$files
[[1]]
# A tibble: 7 × 3
url
<chr>
1 https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
2 https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
3 https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
4 https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
5 https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
6 https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
7 https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
file note
<chr> <chr>
1 /tmp/www.data.gouv.fr/api/1/datasets/competences.csv downloaded
2 /tmp/www.data.gouv.fr/api/1/datasets/codes-competences.csv downloaded
3 /tmp/www.data.gouv.fr/api/1/datasets/categories-competences.csv downloaded
4 /tmp/www.data.gouv.fr/api/1/datasets/Périmètre des EPCI à fiscalité propre - année 2021 (0104).xls downloaded
5 /tmp/www.data.gouv.fr/api/1/datasets/Compétences des groupements - année 2021 (0104).xls downloaded
6 /tmp/www.data.gouv.fr/api/1/datasets/Coordonnées des groupements - année 2021 (0104).xls downloaded
7 /tmp/www.data.gouv.fr/api/1/datasets/Liste des groupements - année 2021 (0104).xls downloaded
I get some warnings about text encodings being invalid in my locale, but it doesn't seem to cause actual problems (and presumably you will not get these warnings in your locale).
Awesome ! I will take a closer look at your answer.
My day will be even more beautiful.
Thanks a lot