soilDB
soilDB copied to clipboard
soil resource inventory functions
Hey developers, I added a couple of functions to get sri data from ECOSHARE website https://ecoshare.info/soils/soil-resource-inventory/. I had chatted with @brownag and @dylanbeaudette on twitter about and thought I'd give it a go. The function idea is to download the zip by gdb (some_name <- get_SRI(gdb = 'Deschutes', layer = 'MapUnits')) and then save to an object. In addition there is a get_SRI_layers() to get the names of layers. I added a few simple tests as well. It's pretty simple but it does let the user get that data with ease in the R environment. Let me know what you think, thanks!
@joshualerickson Overall this looks good, thanks for the PR! Note that the failures of the CI are unrelated to your changes (this is an issue with one of the grids stored on SoilWeb that was recently updated; https://github.com/ncss-tech/soilDB/issues/275)
Two minor things:
- we've been moving away from
download.file()and usingcurl::curl_download()with a custom handle (with a longer timeout and not requiring SSL verification). This is mostly for the vagaries of our USDA network--downloads timing out or HTTPS errors on R 4.2+. So something like this would be consistent with how we are now doing downloads:
curl::curl_download(paste0('https://ecoshare.info/uploads/soils/soil_resource_inventory/', gdb, '_SoilResourceInventory.gdb.zip'),
destfile = temp,
handle = .soilDB_curl_handle()
)
This pattern might be changed again in the future as there are some options that will be available in R 4.3 for SSL verification, and handles can be specified in base R just as well, but for now we use {curl}.
- You can use markdown syntax in the documentation if you would prefer (e.g. ` for code blocks). This is not something we have done retroactively for all docs, but we are moving towards it for easier syntax and readability in the source. No need to change it if it works though!
Sounds good @brownag! Thanks for taking a look! I'll change up the download as well as looking into markdown style docs (interested) and re-commit. This package has helped me a lot with BAER (post-fire stuff) and this would add some help while in Region 6 USDA.
I messed around with this a little bit more today.
I have a couple other small suggestions
get_SRI()could take an argumentdestdir = tempdir()allowing user to customize destination directory for download. If!missing(destdir)then you don't need to unlink the ZIP file.- to extend this, if
gdbis a file path (file.exists(gdb)) rather than a name of one of one of the surveys, then can just load from that directly. This saves having to download the GDB for each layer you want. - It would also be nice if
layerwas vectorized. If length>1 return a named list of results (again so you can get multiple layers from a single download)
- to extend this, if
get_SRI()could take anoverwrite=FALSEargument so that if file exists indestdirthen an informative message rather than redownloading it- If you could spell out the SRI acronym in the docs header that would be good. Might also be worth mentioning that these are for USFS Region 6 in header/description. I like all the details you included.
And another issue:
- When I run
get_SRI_layers()I get back the table list but there are as many warnings. Any idea what causes these? I see that the vsizip and vsimem urls used internally byst_layers()might be an issue? If there is no way around this then we might want to suppress warnings. - Related to this: we could probably use the /vsizip/ URLs and skip the use of
unzip()inget_SRI(). Unzipping the downloaded file might be value added but not strictly necessary for loading into memory--especially when it is a tempfile that will get unlinked. - There is also warnings produced when a layer has no simple features geometries--since this is "expected" for the non-spatial data, those warnings could be suppressed.
library(soilDB)
get_SRI_layers('Deschutes')
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000009.gdbtable (layer MapUnits): No such
#> file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000014.gdbtable (layer TimberSiteIndex):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000016.gdbtable (layer
#> BedrockCharacteristics): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000018.gdbtable (layer
#> LandtypeCharacteristics): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001d.gdbtable (layer
#> SampleSites_ChemicalAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001e.gdbtable (layer
#> SampleSites_SoilMoistureAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001f.gdbtable (layer MaterialsTesting):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000021.gdbtable (layer
#> SampleSites_MaterialsTesting): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000023.gdbtable (layer
#> SoilMoistureAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000024.gdbtable (layer ChemicalAnalysis):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000028.gdbtable (layer
#> Notes_TimberHarvestMethods): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000029.gdbtable (layer
#> Notes_Engineering): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000046.gdbtable (layer ErosionAndHydro):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000047.gdbtable (layer Engineering): No
#> such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000048.gdbtable (layer Recreation): No
#> such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000049.gdbtable (layer TimberManagement):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000004a.gdbtable (layer RangeAndWildlife):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000004b.gdbtable (layer
#> TimberHarvestMethods): No such file or directory
#> Driver: OpenFileGDB
#> Available layers:
#> layer_name geometry_type features fields
#> 1 MapUnits Multi Polygon 0 0
#> 2 TimberSiteIndex NA 0 0
#> 3 BedrockCharacteristics NA 0 0
#> 4 LandtypeCharacteristics NA 0 0
#> 5 SampleSites_ChemicalAnalysis Point 0 0
#> 6 SampleSites_SoilMoistureAnalysis Point 0 0
#> 7 MaterialsTesting NA 0 0
#> 8 SampleSites_MaterialsTesting Point 0 0
#> 9 SoilMoistureAnalysis NA 0 0
#> 10 ChemicalAnalysis NA 0 0
#> 11 Notes_TimberHarvestMethods NA 0 0
#> 12 Notes_Engineering NA 0 0
#> 13 ErosionAndHydro NA 0 0
#> 14 Engineering NA 0 0
#> 15 Recreation NA 0 0
#> 16 TimberManagement NA 0 0
#> 17 RangeAndWildlife NA 0 0
#> 18 TimberHarvestMethods NA 0 0
#> crs_name
#> 1 NAD_1983_Albers
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 NAD_1983_Albers
#> 6 NAD_1983_Albers
#> 7 <NA>
#> 8 NAD_1983_Albers
#> 9 <NA>
#> 10 <NA>
#> 11 <NA>
#> 12 <NA>
#> 13 <NA>
#> 14 <NA>
#> 15 <NA>
#> 16 <NA>
#> 17 <NA>
#> 18 <NA>
Hey @brownag thanks for the suggestions as they are way cleaner and better to get t his data! I ended up using the /vsizip//vsicurl/ method of downloading which got rid of the error in get_SRI_layers() as well as the need to download the zip. Thus, now all in memory :).
Also, added USFS Region 6 in header/description to make it clearer as well as spelling out SRI acronym.
In addition, added the option to ask for multiple layers as an argument. Uses a classic for-loop but could be changed if needed.
Thanks,
Josh
Excellent!
A couple last things before ready for merge:
- since {sf} is only in Suggests for {soilDB} we need to use
requireNamespace()at the beginning of each SRI function. Something like:
if (!requireNamespace("sf"))
stop("package `sf` is required", call. = FALSE)
- I would expose the
quietargument forst_read()in case user needs less verbose output. Default can be FALSE. - Anything making remote requests might need to be wrapped in
try()for safety in case network failure, so possibly result could be try-error rather than sf/data.frame - It seems that some non-spatial tables have list columns, so the
st_read()might need some thinking, or at least error handling as above. For example:
get_SRI("Willamette", "Notes_TimberManagement")
#> Reading layer `Notes_TimberManagement' from data source
#> `/vsizip//vsicurl/https://ecoshare.info/uploads/soils/soil_resource_inventory/Wilamette_SoilResourceInventory.gdb.zip'
#> using driver `OpenFileGDB'
#> Warning: no simple feature geometries present: returning a data.frame or tbl_df
#> Warning in process_cpl_read_ogr(x, quiet, check_ring_dir = check_ring_dir, :
#> list-column(s) present: in case of failure, try read_sf() or as_tibble=TRUE
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 220, 596, 162, 264, 122, 148
- There is a R CMD check NOTE about the length of the
get_SRI()multi-layer example line >100 characters. This can be split onto multiple lines. Thanks for including examples with donttest, I am glad to see they run fairly quickly too! Please also add Roxygen tag@examplesIf curl::has_internet() && requireNamespace("sf")
A few other thoughts that are not necessary:
-
It might be easier for the user if
get_SRI_layers()returned a normal data.frame rather than a sf_layers object. The sf_layers object looks like a list/data.frame but does not have the latter properties in all cases. -
It would be nice if the
gdbargument wasn't case-sensitive, and there was some validation of the names against the list you have included in the docs rather than a failure w/ the URL.- One option could be to define an (internal) function that returns possible GDB names as a character vector that can then be checked with e.g.
match.arg(tolower(gdb), choices = .SRI_gdb_names()). This winds up hard coding the names... but might provide more informative errors to user and allow for a bit more flexibility in input.
- One option could be to define an (internal) function that returns possible GDB names as a character vector that can then be checked with e.g.
-
Any idea if there are other SRI type datasets out there for other FS regions? Just curious as to whether we could further generalize this approach by being able to ID alternate base URL and filenames to use and whether that is something to consider.
-
The for-loop is fine as these lists do not have that many elements in them. My personal tendency would be to use
lapply()but tomato tomato :)... could remove the secondst_read()call for the length 1 case after the else and justreturn(sri[[1]])whenlength(layers)==1. Normally when I do something like this I also have asimplify=TRUEargument required to return which allows it to be returned as list ifsimplify=FALSE. -
Curious whether you have thought about what
fetchSRI()might look like in the future. Usually we have thefetchX()functions as higher level wrappers aroundget_X()functions. Most but not all of them return SoilProfileCollections. Here is an idea to think on:
fetchSRI <- function(gdb) {
lapply(gdb, function(x) {
suppressWarnings(get_SRI(x, layers = get_SRI_layers(x)$name))
})
}
Hey @brownag great stuff as usual!
I'll go through the suggestions below.
- since {sf} is only in Suggests for {soilDB} we need to use requireNamespace() at the beginning of each SRI function. Something like:
if (!requireNamespace("sf"))
stop("package
sfis required", call. = FALSE)
✅ added to both functions
- I would expose the quiet argument for st_read() in case user needs less verbose output. Default can be FALSE.
✅ added as an argument
- Anything making remote requests might need to be wrapped in try() for safety in case network failure, so possibly result could be try-error rather than sf/data.frame
✅ added try() around the function calling the /vsizip//vsicurl/ endpoint. Good call on this one! Now just adds the error to the list.
- It seems that some non-spatial tables have list columns, so the st_read() might need some thinking, or at least error handling as above.
✅ This was fixed by changing the function to read_sf() instead of st_read().
- There is a R CMD check NOTE about the length of the get_SRI() multi-layer example line >100 characters. This can be split onto multiple lines.
✅ Fixed by splitting
- Please also add Roxygen tag @examplesIf curl::has_internet() && requireNamespace("sf")
✅ Added to the Roxygen setup
A few other thoughts that are not necessary:
- It might be easier for the user if get_SRI_layers() returned a normal data.frame rather than a sf_layers object. The sf_layers object looks like a list/data.frame but does not have the latter properties in all cases.
✅ worked by adding list2DF().
- _It would be nice if the gdb argument wasn't case-sensitive, and there was some validation of the names against the list you have included in the docs rather than a failure w/ the URL.
One option could be to define an (internal) function that returns possible GDB names as a character vector that can then be checked with e.g. match.arg(tolower(gdb), choices = .SRI_gdb_names()). This winds up hard coding the names... but might provide more informative errors to user and allow for a bit more flexibility in input._
✅ Took a stab at this. Seems to be good for both upper and lower case as well as upper and lower case with spaces as 'matching' args, e.g. all good (GiffordPinchot, giffordpinchot, Gifford Pinchot, gifford pinchot).
- Any idea if there are other SRI type datasets out there for other FS regions? Just curious as to whether we could further generalize this approach by being able to ID alternate base URL and filenames to use and whether that is something to consider.
❓ Not sure on this one... I can check with some others and see if there is something similar in other regions.
- The for-loop is fine as these lists do not have that many elements in them. My personal tendency would be to use lapply() but tomato tomato :)... could remove the second st_read() call for the length 1 case after the else and just return(sri[[1]]) when length(layers)==1. Normally when I do something like this I also have a simplify=TRUE argument required to return which allows it to be returned as list if simplify=FALSE.
✅ I like this. I added simplify to the arguments and now does the above suggestion.
- _Curious whether you have thought about what fetchSRI() might look like in the future. Usually we have the fetchX() functions as higher level wrappers around get_X() functions. Most but not all of them return SoilProfileCollections. Here is an idea to think on:
fetchSRI <- function(gdb) { lapply(gdb, function(x) { suppressWarnings(get_SRI(x, layers = get_SRI_layers(x)$name)) }) }_
🤔 Not sure I've thought about this but I think I like it :). I'll leave it up to you.
Thanks for all your insight and support in all this!
Thanks for the contribution. Just back from travel, I'll test out this week. What sort of work would these data support? I've never used them and some context would help me evaluate / think of ways to describe in the documentation.
Thanks for the updates Josh.
The only remaining issue is the list2DF in get_SRI_layers() only works with "columns" that have the same number of elements.
Something like this might work more generally--not sure if there is a cleaner/more canonical way to do it
as.data.frame(sapply(layers, I))
Hey @dylanbeaudette, these data would help support areas where SSURGO was not surveyed apparently (I'm not a soils scientist). I used this Soil Resource Inventory (SRI) on a recent Burned Area Emergency Response (BAER) assignment within the Hell's Canyon National Recreation Area (sweet place btw!) to get hydrologic soil group categories for post-fire flow modeling. I usually get this from SSURGO, thanks to you guys, but in this case it wasn't there... From @details;
"Due to the fact that many USFS Region 6 Forests do not have NRCS SSURGO surveys (at a scale of 1:24,000, these are the highest-resolution soils data generally available), USFS Region 6 initiated a project in 2012 to bring these legacy SRI soils data into digital databases to facilitate their use in regional planning activities."
This would fill that void; where SSURGO doesn't exist but SRI does.
Thanks,
Josh
Hey @brownag I changed list2DF to use as.data.frame(sapply(layers, I)) instead (I is a cool function; didn't know about it). I also added the fetchSRI() function as a higher level wrapper. I think I like this as it makes it possible to get multiple gdbs if needed. Let me know what you think, thanks!
Hey @dylanbeaudette, these data would help support areas where SSURGO was not surveyed apparently (I'm not a soils scientist). I used this Soil Resource Inventory (SRI) on a recent Burned Area Emergency Response (BAER) assignment within the Hell's Canyon National Recreation Area (sweet place btw!) to get hydrologic soil group categories for post-fire flow modeling. I usually get this from SSURGO, thanks to you guys, but in this case it wasn't there... From @details;
"Due to the fact that many USFS Region 6 Forests do not have NRCS SSURGO surveys (at a scale of 1:24,000, these are the highest-resolution soils data generally available), USFS Region 6 initiated a project in 2012 to bring these legacy SRI soils data into digital databases to facilitate their use in regional planning activities."
This would fill that void; where SSURGO doesn't exist but SRI does.
Thanks,
Josh
Thanks Josh. I think that what you and Andrew have put together will be really helpful to our staff and beyond. Thanks again for your contributions.
@joshualerickson This works for me and is basically good to go! Thanks for making all these revisions!
A few more :) tiny things
- You should re-use
.get_SRI_gdb_names()infetchSRI(); should be able to vectorize the name-cleaning function by usingmatch.arg(..., several.ok=TRUE) - If the
lapply()infetchSRI()is changed tosapply()the names of the GDBs will be preserved in the result list object without having to set them manually--on further reflection I think the highest level of the result list should be named, and if.get_SRI_gdb_names()is used as above they will be the standard names - In
get_SRI()can we add anas.data.frame(sri)so the non-spatial results are data.frame not tibble? This is a read_sf/st_read thing I think. Or just useas_tibble=FALSE
I have a few last minor documentation requests.
- Put your name in under
@authorRoxygen tag for each of the .Rd files fetchSRI()documentation@descriptiontag is misspelled, generating a warning- First time you reference "gdb" in each .Rd spell out " File Geodatabase" then add parenthetical "(GDB)" and then use the upper-case acronym for subsequent use (where you are not referring to the argument
gdbbut rather the file names of the geodatabases) - To fetchSRI docs add:
@seealso `get_SRI()` `get_SRI_layers()`
- To get_SRI docs add:
@seealso `get_SRI_layers()`
Alright @brownag I think we should be good! But, let me know what you think and we can go from there. Thanks again for working through this with me!
✅ You should re-use .get_SRI_gdb_names() in fetchSRI(); should be able to vectorize the name-cleaning function by using match.arg(..., several.ok=TRUE)
✅ If the lapply() in fetchSRI() is changed to sapply() the names of the GDBs will be preserved in the result list object without having to set them manually--on further reflection I think the highest level of the result list should be named, and if .get_SRI_gdb_names() is used as above they will be the standard names
✅ In get_SRI() can we add an as.data.frame(sri) so the non-spatial results are data.frame not tibble? This is a read_sf/st_read thing I think. Or just use as_tibble=FALSE
✅ Put your name in under @author Roxygen tag for each of the .Rd files
✅ fetchSRI() documentation @description tag is misspelled, generating a warning
✅ First time you reference "gdb" in each .Rd spell out " File Geodatabase" then add parenthetical "(GDB)" and then use the upper-case acronym for subsequent use (where you are not referring to the argument gdb but rather the file names of the geodatabases)
✅ added @seealso to each fetchSRI() and get_SRI()
I am very happy with this. Thanks for all the great work @joshualerickson!