rgbif icon indicating copy to clipboard operation
rgbif copied to clipboard

general purpose rate limiting across pkg

Open sckott opened this issue 6 years ago • 19 comments

via request from GBIF (email title "Re: Help on server error")

except the download request API - though should use for some download routes for checking status/etc.

sckott avatar Aug 29 '18 00:08 sckott

Can folks help me test this? on branch rate-limit I've added rate limiting across the package so regardless of the function used, one should only be able to do 60 requests per minute. You don't have to do anything, just use the package as usual, you can e.g. test how long things are taking with system.time or perhaps the microbenchmark pkg or similar

install like remotes::install_github("ropensci/rgbif@rate-limit")

@damianooldoni @dmcglinn @MattBlissett @jkmccarthy @jwhalennds @poldham @andzandz11

Let me know if you see any potential issues with the internal helper that does the waiting between requests https://github.com/ropensci/rgbif/blob/rate-limit/R/HttpStore.R

sckott avatar Sep 13 '18 22:09 sckott

@maelle can you give this a try and see if you find any problems?

sckott avatar Sep 17 '18 14:09 sckott

@sckott I have installed and will try to find something to give this a whirl with.

poldham avatar Sep 17 '18 14:09 poldham

Looks fine, but I only tested this:

days <- seq(from = Sys.Date() - 240,
            to = Sys.Date(),
            by = 1)

get_one_day <- function(day){
  date <- format(day, "%Y-%m-%d")
  result <- rgbif::occ_search(eventDate = date,
                    country = "fr",
                    limit = 1)$data
  result$time <- Sys.time()
  result
}

results <- purrr::map_df(days, get_one_day)
unique(results$time)
#>   [1] "2018-09-17 16:54:39 CEST" "2018-09-17 16:54:40 CEST"
#>   [3] "2018-09-17 16:54:41 CEST" "2018-09-17 16:54:42 CEST"
#>   [5] "2018-09-17 16:54:43 CEST" "2018-09-17 16:54:44 CEST"
#>   [7] "2018-09-17 16:54:45 CEST" "2018-09-17 16:54:46 CEST"
#>   [9] "2018-09-17 16:54:47 CEST" "2018-09-17 16:54:48 CEST"
#>  [11] "2018-09-17 16:54:49 CEST" "2018-09-17 16:54:50 CEST"
#>  [13] "2018-09-17 16:54:51 CEST" "2018-09-17 16:54:52 CEST"
#>  [15] "2018-09-17 16:54:53 CEST" "2018-09-17 16:54:54 CEST"
#>  [17] "2018-09-17 16:54:55 CEST" "2018-09-17 16:54:56 CEST"
#>  [19] "2018-09-17 16:54:57 CEST" "2018-09-17 16:54:58 CEST"
#>  [21] "2018-09-17 16:54:59 CEST" "2018-09-17 16:55:00 CEST"
#>  [23] "2018-09-17 16:55:01 CEST" "2018-09-17 16:55:02 CEST"
#>  [25] "2018-09-17 16:55:03 CEST" "2018-09-17 16:55:04 CEST"
#>  [27] "2018-09-17 16:55:05 CEST" "2018-09-17 16:55:06 CEST"
#>  [29] "2018-09-17 16:55:07 CEST" "2018-09-17 16:55:08 CEST"
#>  [31] "2018-09-17 16:55:09 CEST" "2018-09-17 16:55:10 CEST"
#>  [33] "2018-09-17 16:55:11 CEST" "2018-09-17 16:55:12 CEST"
#>  [35] "2018-09-17 16:55:13 CEST" "2018-09-17 16:55:14 CEST"
#>  [37] "2018-09-17 16:55:15 CEST" "2018-09-17 16:55:16 CEST"
#>  [39] "2018-09-17 16:55:17 CEST" "2018-09-17 16:55:18 CEST"
#>  [41] "2018-09-17 16:55:19 CEST" "2018-09-17 16:55:20 CEST"
#>  [43] "2018-09-17 16:55:21 CEST" "2018-09-17 16:55:22 CEST"
#>  [45] "2018-09-17 16:55:23 CEST" "2018-09-17 16:55:24 CEST"
#>  [47] "2018-09-17 16:55:25 CEST" "2018-09-17 16:55:26 CEST"
#>  [49] "2018-09-17 16:55:27 CEST" "2018-09-17 16:55:28 CEST"
#>  [51] "2018-09-17 16:55:29 CEST" "2018-09-17 16:55:30 CEST"
#>  [53] "2018-09-17 16:55:31 CEST" "2018-09-17 16:55:32 CEST"
#>  [55] "2018-09-17 16:55:33 CEST" "2018-09-17 16:55:34 CEST"
#>  [57] "2018-09-17 16:55:35 CEST" "2018-09-17 16:55:36 CEST"
#>  [59] "2018-09-17 16:55:37 CEST" "2018-09-17 16:55:38 CEST"
#>  [61] "2018-09-17 16:55:39 CEST" "2018-09-17 16:55:40 CEST"
#>  [63] "2018-09-17 16:55:41 CEST" "2018-09-17 16:55:42 CEST"
#>  [65] "2018-09-17 16:55:43 CEST" "2018-09-17 16:55:44 CEST"
#>  [67] "2018-09-17 16:55:45 CEST" "2018-09-17 16:55:46 CEST"
#>  [69] "2018-09-17 16:55:47 CEST" "2018-09-17 16:55:48 CEST"
#>  [71] "2018-09-17 16:55:49 CEST" "2018-09-17 16:55:50 CEST"
#>  [73] "2018-09-17 16:55:51 CEST" "2018-09-17 16:55:52 CEST"
#>  [75] "2018-09-17 16:55:53 CEST" "2018-09-17 16:55:54 CEST"
#>  [77] "2018-09-17 16:55:55 CEST" "2018-09-17 16:55:56 CEST"
#>  [79] "2018-09-17 16:55:57 CEST" "2018-09-17 16:55:58 CEST"
#>  [81] "2018-09-17 16:55:59 CEST" "2018-09-17 16:56:00 CEST"
#>  [83] "2018-09-17 16:56:01 CEST" "2018-09-17 16:56:02 CEST"
#>  [85] "2018-09-17 16:56:03 CEST" "2018-09-17 16:56:04 CEST"
#>  [87] "2018-09-17 16:56:05 CEST" "2018-09-17 16:56:06 CEST"
#>  [89] "2018-09-17 16:56:07 CEST" "2018-09-17 16:56:08 CEST"
#>  [91] "2018-09-17 16:56:09 CEST" "2018-09-17 16:56:10 CEST"
#>  [93] "2018-09-17 16:56:11 CEST" "2018-09-17 16:56:12 CEST"
#>  [95] "2018-09-17 16:56:13 CEST" "2018-09-17 16:56:14 CEST"
#>  [97] "2018-09-17 16:56:15 CEST" "2018-09-17 16:56:16 CEST"
#>  [99] "2018-09-17 16:56:17 CEST" "2018-09-17 16:56:18 CEST"
#> [101] "2018-09-17 16:56:19 CEST" "2018-09-17 16:56:20 CEST"
#> [103] "2018-09-17 16:56:21 CEST" "2018-09-17 16:56:22 CEST"
#> [105] "2018-09-17 16:56:23 CEST" "2018-09-17 16:56:24 CEST"
#> [107] "2018-09-17 16:56:25 CEST" "2018-09-17 16:56:26 CEST"
#> [109] "2018-09-17 16:56:27 CEST" "2018-09-17 16:56:28 CEST"
#> [111] "2018-09-17 16:56:29 CEST" "2018-09-17 16:56:30 CEST"
#> [113] "2018-09-17 16:56:31 CEST" "2018-09-17 16:56:32 CEST"
#> [115] "2018-09-17 16:56:33 CEST" "2018-09-17 16:56:34 CEST"
#> [117] "2018-09-17 16:56:35 CEST" "2018-09-17 16:56:36 CEST"
#> [119] "2018-09-17 16:56:37 CEST" "2018-09-17 16:56:38 CEST"
#> [121] "2018-09-17 16:56:39 CEST" "2018-09-17 16:56:40 CEST"
#> [123] "2018-09-17 16:56:41 CEST" "2018-09-17 16:56:42 CEST"
#> [125] "2018-09-17 16:56:43 CEST" "2018-09-17 16:56:44 CEST"
#> [127] "2018-09-17 16:56:45 CEST" "2018-09-17 16:56:46 CEST"
#> [129] "2018-09-17 16:56:47 CEST" "2018-09-17 16:56:48 CEST"
#> [131] "2018-09-17 16:56:49 CEST" "2018-09-17 16:56:50 CEST"
#> [133] "2018-09-17 16:56:51 CEST" "2018-09-17 16:56:52 CEST"
#> [135] "2018-09-17 16:56:53 CEST" "2018-09-17 16:56:54 CEST"
#> [137] "2018-09-17 16:56:55 CEST" "2018-09-17 16:56:56 CEST"
#> [139] "2018-09-17 16:56:57 CEST" "2018-09-17 16:56:58 CEST"
#> [141] "2018-09-17 16:56:59 CEST" "2018-09-17 16:57:00 CEST"
#> [143] "2018-09-17 16:57:01 CEST" "2018-09-17 16:57:02 CEST"
#> [145] "2018-09-17 16:57:03 CEST" "2018-09-17 16:57:04 CEST"
#> [147] "2018-09-17 16:57:05 CEST" "2018-09-17 16:57:06 CEST"
#> [149] "2018-09-17 16:57:07 CEST" "2018-09-17 16:57:08 CEST"
#> [151] "2018-09-17 16:57:09 CEST" "2018-09-17 16:57:10 CEST"
#> [153] "2018-09-17 16:57:11 CEST" "2018-09-17 16:57:12 CEST"
#> [155] "2018-09-17 16:57:13 CEST" "2018-09-17 16:57:14 CEST"
#> [157] "2018-09-17 16:57:15 CEST" "2018-09-17 16:57:16 CEST"
#> [159] "2018-09-17 16:57:17 CEST" "2018-09-17 16:57:18 CEST"
#> [161] "2018-09-17 16:57:19 CEST" "2018-09-17 16:57:20 CEST"
#> [163] "2018-09-17 16:57:21 CEST" "2018-09-17 16:57:22 CEST"
#> [165] "2018-09-17 16:57:23 CEST" "2018-09-17 16:57:24 CEST"
#> [167] "2018-09-17 16:57:25 CEST" "2018-09-17 16:57:26 CEST"
#> [169] "2018-09-17 16:57:27 CEST" "2018-09-17 16:57:28 CEST"
#> [171] "2018-09-17 16:57:29 CEST" "2018-09-17 16:57:30 CEST"
#> [173] "2018-09-17 16:57:31 CEST" "2018-09-17 16:57:32 CEST"
#> [175] "2018-09-17 16:57:33 CEST" "2018-09-17 16:57:34 CEST"
#> [177] "2018-09-17 16:57:35 CEST" "2018-09-17 16:57:36 CEST"
#> [179] "2018-09-17 16:57:37 CEST" "2018-09-17 16:57:38 CEST"
#> [181] "2018-09-17 16:57:39 CEST" "2018-09-17 16:57:40 CEST"
#> [183] "2018-09-17 16:57:41 CEST" "2018-09-17 16:57:42 CEST"
#> [185] "2018-09-17 16:57:43 CEST" "2018-09-17 16:57:44 CEST"
#> [187] "2018-09-17 16:57:45 CEST" "2018-09-17 16:57:46 CEST"
#> [189] "2018-09-17 16:57:47 CEST" "2018-09-17 16:57:48 CEST"
#> [191] "2018-09-17 16:57:49 CEST" "2018-09-17 16:57:50 CEST"
#> [193] "2018-09-17 16:57:51 CEST" "2018-09-17 16:57:52 CEST"
#> [195] "2018-09-17 16:57:53 CEST" "2018-09-17 16:57:55 CEST"
#> [197] "2018-09-17 16:57:55 CEST" "2018-09-17 16:57:56 CEST"
#> [199] "2018-09-17 16:57:57 CEST" "2018-09-17 16:57:58 CEST"
#> [201] "2018-09-17 16:57:59 CEST" "2018-09-17 16:58:00 CEST"
#> [203] "2018-09-17 16:58:01 CEST" "2018-09-17 16:58:02 CEST"
#> [205] "2018-09-17 16:58:03 CEST" "2018-09-17 16:58:04 CEST"
#> [207] "2018-09-17 16:58:05 CEST" "2018-09-17 16:58:06 CEST"
#> [209] "2018-09-17 16:58:07 CEST" "2018-09-17 16:58:08 CEST"
#> [211] "2018-09-17 16:58:09 CEST" "2018-09-17 16:58:10 CEST"
#> [213] "2018-09-17 16:58:11 CEST" "2018-09-17 16:58:12 CEST"
#> [215] "2018-09-17 16:58:13 CEST" "2018-09-17 16:58:14 CEST"
#> [217] "2018-09-17 16:58:15 CEST" "2018-09-17 16:58:16 CEST"
#> [219] "2018-09-17 16:58:17 CEST" "2018-09-17 16:58:18 CEST"
#> [221] "2018-09-17 16:58:19 CEST" "2018-09-17 16:58:20 CEST"
#> [223] "2018-09-17 16:58:21 CEST" "2018-09-17 16:58:22 CEST"
#> [225] "2018-09-17 16:58:23 CEST" "2018-09-17 16:58:24 CEST"
#> [227] "2018-09-17 16:58:25 CEST" "2018-09-17 16:58:26 CEST"
#> [229] "2018-09-17 16:58:27 CEST" "2018-09-17 16:58:28 CEST"
#> [231] "2018-09-17 16:58:29 CEST" "2018-09-17 16:58:30 CEST"
#> [233] "2018-09-17 16:58:31 CEST" "2018-09-17 16:58:32 CEST"
#> [235] "2018-09-17 16:58:33 CEST" "2018-09-17 16:58:34 CEST"
#> [237] "2018-09-17 16:58:35 CEST" "2018-09-17 16:58:36 CEST"
#> [239] "2018-09-17 16:58:37 CEST" "2018-09-17 16:58:38 CEST"
#> [241] "2018-09-17 16:58:39 CEST"

Created on 2018-09-17 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       Europe/Paris                
#>  date     2018-09-17
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                         
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)                 
#>  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)                 
#>  base       * 3.5.0      2018-04-23 local                          
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)                 
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.5.0)                 
#>  colorspace   1.4-0      2018-08-14 R-Forge (R 3.5.1)              
#>  compiler     3.5.0      2018-04-23 local                          
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)                 
#>  crul         0.6.0      2018-07-10 CRAN (R 3.5.0)                 
#>  curl         3.2        2018-03-28 CRAN (R 3.5.0)                 
#>  data.table   1.11.4     2018-05-27 CRAN (R 3.5.0)                 
#>  datasets   * 3.5.0      2018-04-23 local                          
#>  devtools     1.13.6     2018-06-27 CRAN (R 3.5.1)                 
#>  digest       0.6.17     2018-09-12 CRAN (R 3.5.1)                 
#>  dplyr        0.7.6      2018-06-29 CRAN (R 3.5.1)                 
#>  evaluate     0.11       2018-07-17 CRAN (R 3.5.1)                 
#>  geoaxe       0.1.0      2016-02-19 CRAN (R 3.5.0)                 
#>  ggplot2      3.0.0      2018-07-03 CRAN (R 3.5.1)                 
#>  glue         1.3.0      2018-07-17 CRAN (R 3.5.0)                 
#>  graphics   * 3.5.0      2018-04-23 local                          
#>  grDevices  * 3.5.0      2018-04-23 local                          
#>  grid         3.5.0      2018-04-23 local                          
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.5.0)                 
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.1)                 
#>  httpcode     0.2.0      2016-11-14 CRAN (R 3.5.0)                 
#>  httr         1.3.1      2017-08-20 CRAN (R 3.5.0)                 
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.5.0)                 
#>  knitr        1.20       2018-02-20 CRAN (R 3.5.0)                 
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.5.0)                 
#>  lazyeval     0.2.1      2017-10-29 CRAN (R 3.5.0)                 
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)                 
#>  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)                 
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)                 
#>  methods    * 3.5.0      2018-04-23 local                          
#>  munsell      0.5.0      2018-06-12 CRAN (R 3.5.0)                 
#>  oai          0.2.2      2016-11-24 CRAN (R 3.5.0)                 
#>  pillar       1.3.0      2018-07-14 CRAN (R 3.5.1)                 
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.5.0)                 
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)                 
#>  purrr        0.2.5      2018-05-29 CRAN (R 3.5.0)                 
#>  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)                 
#>  Rcpp         0.12.18    2018-07-23 CRAN (R 3.5.0)                 
#>  rgbif        1.0.2.9421 2018-09-17 Github (ropensci/rgbif@6584a42)
#>  rgeos        0.3-28     2018-06-08 CRAN (R 3.5.1)                 
#>  rlang        0.2.2      2018-08-16 CRAN (R 3.5.1)                 
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.5.0)                 
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                 
#>  scales       1.0.0      2018-08-09 CRAN (R 3.5.1)                 
#>  sp           1.3-1      2018-06-05 CRAN (R 3.5.0)                 
#>  stats      * 3.5.0      2018-04-23 local                          
#>  stringi      1.2.4      2018-07-23 local                          
#>  stringr      1.3.1      2018-05-10 CRAN (R 3.5.0)                 
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)                 
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)                 
#>  tools        3.5.0      2018-04-23 local                          
#>  triebeard    0.3.0      2016-08-04 CRAN (R 3.5.0)                 
#>  urltools     1.7.1      2018-08-03 CRAN (R 3.5.1)                 
#>  utils      * 3.5.0      2018-04-23 local                          
#>  whisker      0.3-2      2013-04-28 CRAN (R 3.4.0)                 
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                 
#>  xml2         1.2.0      2018-01-24 CRAN (R 3.5.0)                 
#>  yaml         2.2.0      2018-07-25 CRAN (R 3.5.1)

maelle avatar Sep 17 '18 15:09 maelle

thanks @maelle !

looks like its working as expected

sckott avatar Sep 17 '18 15:09 sckott

Here are the timings of 10 iterations [each] before [sec]:

      min       lq   mean  median       uq     max neval
 4998.093 5016.831 5110.7 5108.02 5189.526 5248.55    10

The timings afterwards are still running after 20 hours and it seems the first iteration has just finished! Everything gets slowed down from 29.72 requests per second to an average of 1.018 requests per second. So definetly an unacceptable slowdown for me. If this change gets pushed through without an option to skip the limit, the package becomes unusable for me.

Can you please post the email from GBIF that requested you to make this change? Which offical board was resposible for this?

@dnoesgaard @MattBlissett @mdoering

Andreas-Bio avatar Sep 19 '18 11:09 Andreas-Bio

thanks for testing @andzandz11 !

Yes, the goal was to limit to 1 request per second, or 60 requests per minute, as requested by GBIF.

I don't have a sense for whether GBIF is flexible on this or not. Any thoughts @MattBlissett @timrobertson100

sckott avatar Sep 19 '18 17:09 sckott

Thanks @andzandz11 and @sckott

The request to explore this came from me as there have been a few instances recently where rogue scripts (e.g. infinite loops) have been issuing a lot of requests to GBIF.org. When it comes to the occurrence APIs of GBIF, it makes little sense to be issuing a lot of deep paging requests when a single download call can bring any filtered result set far more efficiently, and with DOI based citation. I asked Scott to explore options to rate limit in the client, as we also explore dynamically throttling based on IP to safeguard the services.

It would be helpful to understand what query patterns require you to hit GBIF occurrence search services s often from a single R application. Normally we'd recommend the download service for that. Can you elaborate on your use case please?

timrobertson100 avatar Sep 19 '18 18:09 timrobertson100

Sorry for late reply. Typically I need to retrieve extension information such as distribution, description and species profile for thousands of taxa in several species checklists (no occurrences involved!). For example, retrieving distribution for more than 2600 taxa takes 40 minutes via branch rate-limit instead of 4 minutes by using the master branch. I agree with @timrobertson100 about the correct use of asynchronous download for occurrences. Maybe set a rate limit only for occurrences and not for checklist related functions would be an option?

damianooldoni avatar Sep 19 '18 20:09 damianooldoni

I am regularly building barcode reference databases from scratch using a R script (data from GenBank). I re-build these databases from time to time to fix some errors or to incorporate new sequences that have been pusblished on GenBank and in the same script I call the GBIF backbone to get the species key and then I use this key to count occurences in multiple countrys using count_facet to score presence/absence. I have somewhat ~74000 species in the database, and apart from downloading tens of GB of .csv files I see no other way than just to loop over it using R. The script is fully automated and working really well. It is fast, always up to date, has a small memory footprint, uses very little bandwidth and leaves no trash behind (R is really bad getting the data out of RAM and it will slow my machine down considerably during the runtime). Just downloading almost world-wide occurence data for plants will make my RAM explode, I can't do that. It is also a lot of overkill because I just need the occurence counts per counry. If somebody knows how to download a table just containing all plant species vs. country occurence counts please let me know. My solution would be to introduce the request throttling but maybe exempt API-Key users at the same time. So people who go through the effort to request an API key will still be able to work undisturbed. Has the additional benefit of being able to lock individual API keys if they are being misused. To be able to spontaneously try out new scripts without having to design them around some kind of request limit is so valuable (in a time=salary manner) to my project, I would even rather buy an API key with a high limit on it rather than having my development time slowed down. I am no fan of IP throttling, if you start throttling IPs from some German universities, people will be very upset I guess. It is also intransparent and frustrating, because the first tests of your script will run very fast and if run the whole thing it will miraculously need the whole day.

Andreas-Bio avatar Sep 20 '18 00:09 Andreas-Bio

Thank you @damianooldoni and @andzandz11 for taking the time to clarify your use cases. It is great to hear that you find the services useful, and please be assured that our objective is to ensure quality of service and not to negatively affect real usage.

Based on the feedback I propose that this not be included @sckott and GBIF consider alternatives - in particular that we should only activate defensive throttling when we observe issue (e.g. the DDOS) which is not the norm. Thank you for exploring this though - and sorry to waste your time.

Off topic to this thread: @andzandz11 - we are going to be expanding output formats from GBIF in the coming weeks / months. The first will be species lists derived from occurrence search which is already in test. Would it be of any interest to have a service that allows a list of species to be POST'ed, and for the response to be a matrix of "species, country, count" for example? If you could help specify any formats that would be immediately useful to you, please let us know ([email protected]).

timrobertson100 avatar Sep 20 '18 08:09 timrobertson100

@timrobertson100 allowing to POST a species list (list of species IDs) as a paramater for an occurrence search/download would certainly cater to our main use case for the TrIAS project!

peterdesmet avatar Sep 20 '18 09:09 peterdesmet

@timrobertson100 I would also like to support the POST method. I frequently end up with a few thousand species of interest (for national reports for example) and want to retrieve the occurrence data only for those species using the IDs. At present that would involve making individual calls (e.g. for 4,000 species) or combining into a query which will run for a while and then fail. My work around has been to use bounding boxes on the website etc but that involves too much guess work and a lot of unnecessary data (e.g. I recently did the whole of South East Asia to get to marine species with occurrences in the ASEAN region). So I think a POST method would be a great help to those of us working with species data at the level of thousands. On rate limiting, I can recognise the need for that in some circumstances but if it can be avoided that really would be much better.

poldham avatar Sep 20 '18 09:09 poldham

Also very helpful would be if you can specify the fields you want returned as information. For example I have 80000 species and I just need the "country" data, but right now, using the website download function, I have to get the whole dataset which is 99GB and too big to be handled by R properly. Even with the POST method the whole world-wide dataset that is being returned would be too large.

Andreas-Bio avatar Sep 20 '18 11:09 Andreas-Bio

Thank you all - very useful.

Would it be of any interest to allow a user to post a SQL statement for an asynchronous download?

It would be for the more experience user, take a few minutes to return and we'd probably need to sanitise and offer only a subset of SQL (single table, aggregations, groupings etc) but we could allow e.g.:

-- species richness by 10 degree latitudinal band (pseudo SQL follows for example)
SELECT FLOOR(decimalLatitude/10) AS latitudeBand, COUNT(DISTINCT species) as speciesCount
FROM occurrence
WHERE genusKey=... AND...
GROUP BY latitudeBand

CC @mattblissett @GBIF for info as we consider options

timrobertson100 avatar Sep 20 '18 13:09 timrobertson100

@timrobertson100 that would be really nice! Since you can request aggregated data via SQL, I assume the downloads would of another type than the current GBIF occurrence downloads?

peterdesmet avatar Sep 20 '18 14:09 peterdesmet

@timrobertson100 no worries, not a waste of time. had fun writing it

sckott avatar Sep 20 '18 16:09 sckott

@peterdesmet

Yes a SQL download would be a new service. I have wondered about it several times, but I have seen a few instances recently where I think it might be an enabling service.

timrobertson100 avatar Sep 20 '18 17:09 timrobertson100

I'll leave this issue open and leave the work on the branch (rate-limit) in case we need to roll it in later. Thanks all!

sckott avatar Sep 25 '18 19:09 sckott

@jhnwllr fun coincidence you're closing this now as I just added throttling/rate limiting to another package! :smile_cat:

maelle avatar Sep 14 '23 12:09 maelle

@maelle I closed it because I thought the issue had sort of become out of date. I am not sure any rate limiting is needed at all. I have been abusing the GBIF API for years and it's fine.

jhnwllr avatar Sep 14 '23 13:09 jhnwllr

oh yeah it makes sense! I was just reacting on the coicidence of topics, not judging the decision. :smiley:

maelle avatar Sep 14 '23 13:09 maelle