troubleshoot 429 errors for functions relying on ATTAINS geospatial web services
Discuss prevalence of this error with ATTAINS team and determine if adding a pause between requests in the impacted functions would be a viable solution. If yes, make this change in the functions.
Is there any advantage to using arcgislayers package https://github.com/R-ArcGIS/arcgislayers?
From what I can tell right now the loop just keeps going until an empty response comes back (so at least one more than it needs to). Might be worth getting a count and then doing that many requests. The limit I'm seeing is 10k rather than 1k, I thought the default was 2k - we could also get the limit from the service first in case it ever changed.
The query is relatively static/simple (just long list of params) definitely pros to using a package designed for hitting esri services, just have to make sure it meets your needs and doesn't broaden dependencies too much.
looks like arcgislayers package is being used in fetchNHD()...
@mhweber - I used to use NHDplusTools or something like that to get NHD but I feel like there has been change since then - is there a good package for getting NHD HR? (I see flowlines, waterbodies, catchments)
Looks like httr not httr2 is being used - is that consistent throughout TADA?
Looks like httr not httr2 is being used - is that consistent throughout TADA?
Ah, good catch. USGS dataRetrieval switched to httr2 recently. I will update this so we are consistently using httr2 as well.
Looks like httr2 includes built-in support for features like rate-limiting (req_throttle()) if this ends up having to go that route.
I used httr2 for altUSEPA/rExpertQuery (some functions are used in EPATADA), so it would be great to have both packages be consistent.
@jbousquin There is an NHDPlus HR service available but I think nhdplusTools is still just using a downloader for NHDPlusHR and only leveraging services for NHDPlus Medium Res through the NLDI (be worth asking Dave - I can ask). @hillarymarler , @jbousquin I switched everything in StreamCatTools over to httr2 recently and am successfully using req_throttle() and req_retry() in my sc_get_data function in the package. Also you may already be doing something similar it sounds like but I split requests if I exceed a certain number of entries and pass a list of requests to a create_post_request() function using purrr::map_dfr.
Utilities::TADA_CheckColumns() seems like it could be leveraged for the spatial cols checks in these functions - everyone OK with me updating those (just going to add arg for non-default message so default is current behavior)? Wanted to check before expending scope of branch.
Heads: was getting a consistent 500 Internal Server Error from dataRetrieval::getWebServiceData(baseURL) when trying to use the fetchATTAINS() example data:
tada_data <- TADA_DataRetrieval(
+ startDate = "1990-01-01",
+ endDate = "1990-12-30",
+ characteristicName = "pH",
+ statecode = "NV",
+ applyautoclean = TRUE,
+ ask = FALSE
+ )
Grabbed the query info from the last test run to have something local to work though it with:
tada_data <- TADA_DataRetrieval(
+ startDate = "2022-06-07",
+ endDate = "2022-06-08",
+ characteristicName = "pH",
+ statecode = "NY",
+ applyautoclean = TRUE,
+ ask = FALSE
+ )
examples in geospatialFunctions.R are \dontrun - but may be worth including in build tests (not familiar w/ what testthat does or doesn't do in that respect).
As I go I'm realizing the query exit condition is nested meaning it's rarely just 1 extra query to get an empty response, rather it is when all responses from all layers are empty and if it is a large area there is some added splitting into clustered bboxes. These are presumably a lot of very small or empty queries which could be sent in fast succession causing the error. Working on a batching refactor that should avoid that. I'm not convinced there is a huge efficiency gain from the bbox clustering - has any profiling/performance testing been done on that?
As I go I'm realizing the query exit condition is nested meaning it's rarely just 1 extra query to get an empty response, rather it is when all responses from all layers are empty and if it is a large area there is some added splitting into clustered bboxes. These are presumably a lot of very small or empty queries which could be sent in fast succession causing the error. Working on a batching refactor that should avoid that. I'm not convinced there is a huge efficiency gain from the bbox clustering - has any profiling/performance testing been done on that?
@kathryn-willi Hope you are doing well. Justin is helping us troubleshoot some issues with the ATTAINS queries. I don't recall if any profiling/performance testing had been done on the bbox clustering approach in fetchATTAINS. Do you recall why we went with this approach? Thanks!
Was the bbox clustering related to the long run time and general speed improvement in large spatial data pull #589 and with the tribal example data and errors in #583?
Hi team, yes, the bbox clustering approach was used to speed up long run times. It essentially clusters WQP points into groups so that a massive bbox containing all WQP points isn't used (i.e., reducing the likelihood of returning a bunch of ATTAINS features unrelated to the WQP points). There were some speed tests performed and this approach was much faster when the bbox was above a certain size (I believe that size is mentioned in a commented line above the splitting function!)