bowerbird icon indicating copy to clipboard operation
bowerbird copied to clipboard

how to add datasource with source_url without filename in url

Open jdlom opened this issue 3 years ago • 2 comments

Hi, I try to create a function as bb_zenodo_source https://github.com/ropensci/bowerbird/blob/53d7bcb56c75b10c1f42b418bec870bc0eff44df/R/zenodo.R#L23 I would like to download data from french data.gouv.fr portal.

Unfortunately lastest data are accessible for example with a url of this form : https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2

When i run the sync without the extra param force_local_filename. the data is downloaded but I didn't have the data in the root folder. It works great with this param.

I try to create a custom function to hack the bb_hanlder_rget function to use the force_local_fileame extra param... but with no success...

Any idea on how to achieve this ?

library(tidyverse)
library(fs)
library(bowerbird)

id_data_gouv <- "5e1f20058b4c414d3f94460d"


bb_data_gouv_source <- function(id) {
  ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
  jx <- jsonlite::fromJSON(paste0("https://www.data.gouv.fr/api/1/datasets/", id_data_gouv))
  #collection sizes
  csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
  id <- id
  description <- jx$description
  folder <- jx$slug
  doc_url <-jx$page
  postproc <- list()
  files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
  bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
            id = id,
            description = ne_or(jx$description, "Dataset description"),
            ##keywords = ne_or(jx$metadata$keywords, NA_character_),
            doc_url = jx$page,
            citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
            license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
            source_url = jx$resources$latest, ## list all urls. Does this cover datasets with multiple buckets? (Are there such things?)
            method = list("bb_data_gouv_handler_get", files = files), 
            comment = "Source definition created by bb_zenodo_source",
            postprocess = postproc,
            collection_size = csize)
  
  
}

bb_data_gouv_handler_get <- function(config, verbose = TRUE, files, ...) {
  cfrow <- bowerbird:::bb_settings_to_cols(config)
  this_flags <- list(...)
  this_flags <- c(list(url = cfrow$source_url), list(force_local_filename = files),this_flags, list(verbose = verbose))
  do.call(bb_rget, this_flags)
}


src <- bb_data_gouv_source(id_data_gouv)
cf <- bb_config(local_file_root = "~")
cf <- bb_add(cf, src )
status <- bb_sync(cf, create_root = TRUE, verbose = TRUE)
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)

Created on 2021-10-12 by the reprex package (v2.0.1)

jdlom avatar Oct 12 '21 14:10 jdlom

Hi @jdlom ... I've got to say, you have done well getting that close! This situation doesn't quite fit the scenario that bowerbird was originally designed for, which makes the solution a bit awkward. And even at the best of times the internals of how this all works are pretty obscure. Anyway, I'd suggest this:

bb_data_gouv_source <- function(id) {
    ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
    api_url <- paste0("https://www.data.gouv.fr/api/1/datasets/", id)
    jx <- jsonlite::fromJSON(api_url)
    ## collection sizes
    csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
    bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
              id = id,
              description = ne_or(jx$description, "Dataset description"),
              ##keywords = ne_or(jx$metadata$keywords, NA_character_),
              doc_url = jx$page,
              citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
              license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
              source_url = api_url, ## delay the retrieval of the actual file list until sync time
              method = list("bb_handler_data_gouv"),
              comment = "Source definition created by bb_data_gouv_source",
              postprocess = list(),
              collection_size = csize)
}

bb_handler_data_gouv <- function(...) {
    bb_handler_data_gouv_inner(...)
}

bb_handler_data_gouv_inner <- function(config, verbose = FALSE, local_dir_only = FALSE, ...) {
    ## retrieve the list of urls and their associated file names
    if (local_dir_only) return(bb_handler_rget(config, verbose = verbose, local_dir_only = TRUE, ...))
    target_dir <- sub("[\\/]$", "", bb_data_source_dir(config))
    if (!dir.exists(target_dir)) {
        ok <- dir.create(target_dir, recursive = TRUE)
        if (!ok) {
            stop(sprintf("Could not create target directory %s: aborting.\n", target_dir))
        }
    }
    settings <- bowerbird:::save_current_settings()
    on.exit({ bowerbird:::restore_settings(settings) })
    setwd(target_dir)
    ds <- bb_data_sources(config)
    jx <- jsonlite::fromJSON(ds$source_url)
    files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
    ## make the file names safe - e.g. they cannot have "/" in them
    files <- fs::path_sanitize(files)
    urls <- jx$resources$latest
    all_ok <- TRUE
    msg <- c()
    downloads <- tibble(url = character(), file = character(), was_downloaded = logical())
    for (i in seq_along(urls)) {
        dummy <- config
        temp <- bb_data_sources(dummy)
        temp$source_url[[1]] <- urls[i]
        bb_data_sources(dummy) <- temp
        ## pass to the rget handler
        ## we could do it directly here with GET calls, but simpler to use the rget handler functionality
        this <- bb_handler_rget(dummy, verbose = verbose, level = 0, use_url_directory = FALSE, force_local_filename = files[i])
        all_ok <- all_ok && this$ok
        if (nrow(this$files[[1]])>0) {
            this$files[[1]]$file <- file.path(target_dir, this$files[[1]]$file)
            downloads <- rbind(downloads, this$files[[1]])
        }
        if (nzchar(this$message)) msg <- c(msg, this$message)
    }
    if (length(msg) < 1) msg <- ""
    tibble(ok = all_ok, files = list(downloads), message = msg)
}

And then:

> src <- bb_data_gouv_source("5e1f20058b4c414d3f94460d")
> res <- bb_get(src, local_file_root = "/tmp", verbose = TRUE)

Wed Oct 13 17:04:46 2021
Synchronizing dataset: Base nationale sur les intercommunalités
Source URL https://www.data.gouv.fr/api/1/datasets/5e1f20058b4c414d3f94460d
--------------------------------------------------------------------------------------------

 this dataset path is: /tmp/www.data.gouv.fr/api/1/datasets
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75 ...  
  |=====================================================================================================================| 100%
Downloading: 59 kB     
 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3 ...  
  |=====================================================================================================================| 100%
Downloading: 2.6 kB     
 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b ...  
  |=====================================================================================================================| 100%
  |=====================================================================================================================| 100%

 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2 ...  
  |=====================================================================================================================| 100%
Downloading: 21 MB        
 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392 ...  
  |=====================================================================================================================| 100%
Downloading: 5.6 MB       
 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f ...  
  |=====================================================================================================================| 100%
Downloading: 3.5 MB       
 done.
 downloading file 1 of 1: https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6 ...  
  |=====================================================================================================================| 100%
Downloading: 2.1 MB       
 done.

Wed Oct 13 17:06:28 2021 dataset synchronization complete: Base nationale sur les intercommunalités

> res$files
[[1]]
# A tibble: 7 × 3
  url                                                                        
  <chr>                                                                      
1 https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
2 https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
3 https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
4 https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
5 https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
6 https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
7 https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
  file                                                                                               note      
  <chr>                                                                                              <chr>     
1 /tmp/www.data.gouv.fr/api/1/datasets/competences.csv                                               downloaded
2 /tmp/www.data.gouv.fr/api/1/datasets/codes-competences.csv                                         downloaded
3 /tmp/www.data.gouv.fr/api/1/datasets/categories-competences.csv                                    downloaded
4 /tmp/www.data.gouv.fr/api/1/datasets/Périmètre des EPCI à fiscalité propre - année 2021 (0104).xls downloaded
5 /tmp/www.data.gouv.fr/api/1/datasets/Compétences des groupements - année 2021 (0104).xls           downloaded
6 /tmp/www.data.gouv.fr/api/1/datasets/Coordonnées des groupements - année 2021 (0104).xls           downloaded
7 /tmp/www.data.gouv.fr/api/1/datasets/Liste des groupements - année 2021 (0104).xls                 downloaded

I get some warnings about text encodings being invalid in my locale, but it doesn't seem to cause actual problems (and presumably you will not get these warnings in your locale).

raymondben avatar Oct 13 '21 06:10 raymondben

Awesome ! I will take a closer look at your answer.

My day will be even more beautiful.

Thanks a lot

jdlom avatar Oct 13 '21 06:10 jdlom