soilDB icon indicating copy to clipboard operation
soilDB copied to clipboard

soil resource inventory functions

Open joshualerickson opened this issue 3 years ago • 5 comments

Hey developers, I added a couple of functions to get sri data from ECOSHARE website https://ecoshare.info/soils/soil-resource-inventory/. I had chatted with @brownag and @dylanbeaudette on twitter about and thought I'd give it a go. The function idea is to download the zip by gdb (some_name <- get_SRI(gdb = 'Deschutes', layer = 'MapUnits')) and then save to an object. In addition there is a get_SRI_layers() to get the names of layers. I added a few simple tests as well. It's pretty simple but it does let the user get that data with ease in the R environment. Let me know what you think, thanks!

joshualerickson avatar Oct 23 '22 01:10 joshualerickson

@joshualerickson Overall this looks good, thanks for the PR! Note that the failures of the CI are unrelated to your changes (this is an issue with one of the grids stored on SoilWeb that was recently updated; https://github.com/ncss-tech/soilDB/issues/275)

Two minor things:

  • we've been moving away from download.file() and using curl::curl_download() with a custom handle (with a longer timeout and not requiring SSL verification). This is mostly for the vagaries of our USDA network--downloads timing out or HTTPS errors on R 4.2+. So something like this would be consistent with how we are now doing downloads:
curl::curl_download(paste0('https://ecoshare.info/uploads/soils/soil_resource_inventory/', gdb, '_SoilResourceInventory.gdb.zip'), 
  destfile = temp, 
  handle = .soilDB_curl_handle()
)

This pattern might be changed again in the future as there are some options that will be available in R 4.3 for SSL verification, and handles can be specified in base R just as well, but for now we use {curl}.

  • You can use markdown syntax in the documentation if you would prefer (e.g. ` for code blocks). This is not something we have done retroactively for all docs, but we are moving towards it for easier syntax and readability in the source. No need to change it if it works though!

brownag avatar Oct 24 '22 16:10 brownag

Sounds good @brownag! Thanks for taking a look! I'll change up the download as well as looking into markdown style docs (interested) and re-commit. This package has helped me a lot with BAER (post-fire stuff) and this would add some help while in Region 6 USDA.

joshualerickson avatar Oct 24 '22 20:10 joshualerickson

I messed around with this a little bit more today.

I have a couple other small suggestions

  • get_SRI() could take an argument destdir = tempdir() allowing user to customize destination directory for download. If !missing(destdir) then you don't need to unlink the ZIP file.
    • to extend this, if gdb is a file path (file.exists(gdb)) rather than a name of one of one of the surveys, then can just load from that directly. This saves having to download the GDB for each layer you want.
    • It would also be nice if layer was vectorized. If length>1 return a named list of results (again so you can get multiple layers from a single download)
  • get_SRI() could take an overwrite=FALSE argument so that if file exists in destdir then an informative message rather than redownloading it
  • If you could spell out the SRI acronym in the docs header that would be good. Might also be worth mentioning that these are for USFS Region 6 in header/description. I like all the details you included.

And another issue:

  • When I run get_SRI_layers() I get back the table list but there are as many warnings. Any idea what causes these? I see that the vsizip and vsimem urls used internally by st_layers() might be an issue? If there is no way around this then we might want to suppress warnings.
  • Related to this: we could probably use the /vsizip/ URLs and skip the use of unzip() in get_SRI(). Unzipping the downloaded file might be value added but not strictly necessary for loading into memory--especially when it is a tempfile that will get unlinked.
  • There is also warnings produced when a layer has no simple features geometries--since this is "expected" for the non-spatial data, those warnings could be suppressed.
library(soilDB)
get_SRI_layers('Deschutes')
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000009.gdbtable (layer MapUnits): No such
#> file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000014.gdbtable (layer TimberSiteIndex):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000016.gdbtable (layer
#> BedrockCharacteristics): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000018.gdbtable (layer
#> LandtypeCharacteristics): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001d.gdbtable (layer
#> SampleSites_ChemicalAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001e.gdbtable (layer
#> SampleSites_SoilMoistureAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000001f.gdbtable (layer MaterialsTesting):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000021.gdbtable (layer
#> SampleSites_MaterialsTesting): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000023.gdbtable (layer
#> SoilMoistureAnalysis): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000024.gdbtable (layer ChemicalAnalysis):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000028.gdbtable (layer
#> Notes_TimberHarvestMethods): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000029.gdbtable (layer
#> Notes_Engineering): No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000046.gdbtable (layer ErosionAndHydro):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000047.gdbtable (layer Engineering): No
#> such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000048.gdbtable (layer Recreation): No
#> such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a00000049.gdbtable (layer TimberManagement):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000004a.gdbtable (layer RangeAndWildlife):
#> No such file or directory
#> Warning in CPL_get_layers(dsn, options, do_count): GDAL Error 4: Cannot
#> open /vsizip//vsimem/http_1/Deschutes_SoilResourceInventory.gdb.zip/
#> Deschutes_SoilResourceInventory.gdb/a0000004b.gdbtable (layer
#> TimberHarvestMethods): No such file or directory
#> Driver: OpenFileGDB 
#> Available layers:
#>                          layer_name geometry_type features fields
#> 1                          MapUnits Multi Polygon        0      0
#> 2                   TimberSiteIndex            NA        0      0
#> 3            BedrockCharacteristics            NA        0      0
#> 4           LandtypeCharacteristics            NA        0      0
#> 5      SampleSites_ChemicalAnalysis         Point        0      0
#> 6  SampleSites_SoilMoistureAnalysis         Point        0      0
#> 7                  MaterialsTesting            NA        0      0
#> 8      SampleSites_MaterialsTesting         Point        0      0
#> 9              SoilMoistureAnalysis            NA        0      0
#> 10                 ChemicalAnalysis            NA        0      0
#> 11       Notes_TimberHarvestMethods            NA        0      0
#> 12                Notes_Engineering            NA        0      0
#> 13                  ErosionAndHydro            NA        0      0
#> 14                      Engineering            NA        0      0
#> 15                       Recreation            NA        0      0
#> 16                 TimberManagement            NA        0      0
#> 17                 RangeAndWildlife            NA        0      0
#> 18             TimberHarvestMethods            NA        0      0
#>           crs_name
#> 1  NAD_1983_Albers
#> 2             <NA>
#> 3             <NA>
#> 4             <NA>
#> 5  NAD_1983_Albers
#> 6  NAD_1983_Albers
#> 7             <NA>
#> 8  NAD_1983_Albers
#> 9             <NA>
#> 10            <NA>
#> 11            <NA>
#> 12            <NA>
#> 13            <NA>
#> 14            <NA>
#> 15            <NA>
#> 16            <NA>
#> 17            <NA>
#> 18            <NA>

brownag avatar Oct 25 '22 17:10 brownag

Hey @brownag thanks for the suggestions as they are way cleaner and better to get t his data! I ended up using the /vsizip//vsicurl/ method of downloading which got rid of the error in get_SRI_layers() as well as the need to download the zip. Thus, now all in memory :).

Also, added USFS Region 6 in header/description to make it clearer as well as spelling out SRI acronym.

In addition, added the option to ask for multiple layers as an argument. Uses a classic for-loop but could be changed if needed.

Thanks,

Josh

joshualerickson avatar Oct 27 '22 01:10 joshualerickson

Excellent!

A couple last things before ready for merge:

  • since {sf} is only in Suggests for {soilDB} we need to use requireNamespace() at the beginning of each SRI function. Something like:
if (!requireNamespace("sf")) 
  stop("package `sf` is required", call. = FALSE)
  • I would expose the quiet argument for st_read() in case user needs less verbose output. Default can be FALSE.
  • Anything making remote requests might need to be wrapped in try() for safety in case network failure, so possibly result could be try-error rather than sf/data.frame
  • It seems that some non-spatial tables have list columns, so the st_read() might need some thinking, or at least error handling as above. For example:
get_SRI("Willamette", "Notes_TimberManagement")
#> Reading layer `Notes_TimberManagement' from data source 
#>   `/vsizip//vsicurl/https://ecoshare.info/uploads/soils/soil_resource_inventory/Wilamette_SoilResourceInventory.gdb.zip' 
#>   using driver `OpenFileGDB'
#> Warning: no simple feature geometries present: returning a data.frame or tbl_df
#> Warning in process_cpl_read_ogr(x, quiet, check_ring_dir = check_ring_dir, :
#> list-column(s) present: in case of failure, try read_sf() or as_tibble=TRUE
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 220, 596, 162, 264, 122, 148
  • There is a R CMD check NOTE about the length of the get_SRI() multi-layer example line >100 characters. This can be split onto multiple lines. Thanks for including examples with donttest, I am glad to see they run fairly quickly too! Please also add Roxygen tag @examplesIf curl::has_internet() && requireNamespace("sf")

A few other thoughts that are not necessary:

  • It might be easier for the user if get_SRI_layers() returned a normal data.frame rather than a sf_layers object. The sf_layers object looks like a list/data.frame but does not have the latter properties in all cases.

  • It would be nice if the gdb argument wasn't case-sensitive, and there was some validation of the names against the list you have included in the docs rather than a failure w/ the URL.

    • One option could be to define an (internal) function that returns possible GDB names as a character vector that can then be checked with e.g. match.arg(tolower(gdb), choices = .SRI_gdb_names()). This winds up hard coding the names... but might provide more informative errors to user and allow for a bit more flexibility in input.
  • Any idea if there are other SRI type datasets out there for other FS regions? Just curious as to whether we could further generalize this approach by being able to ID alternate base URL and filenames to use and whether that is something to consider.

  • The for-loop is fine as these lists do not have that many elements in them. My personal tendency would be to use lapply() but tomato tomato :)... could remove the second st_read() call for the length 1 case after the else and just return(sri[[1]]) when length(layers)==1. Normally when I do something like this I also have a simplify=TRUE argument required to return which allows it to be returned as list if simplify=FALSE.

  • Curious whether you have thought about what fetchSRI() might look like in the future. Usually we have the fetchX() functions as higher level wrappers around get_X() functions. Most but not all of them return SoilProfileCollections. Here is an idea to think on:

fetchSRI <- function(gdb) {
  lapply(gdb, function(x) {
    suppressWarnings(get_SRI(x, layers = get_SRI_layers(x)$name))
  })
}

brownag avatar Oct 27 '22 05:10 brownag

Hey @brownag great stuff as usual!

I'll go through the suggestions below.

  1. since {sf} is only in Suggests for {soilDB} we need to use requireNamespace() at the beginning of each SRI function. Something like: if (!requireNamespace("sf")) stop("package sf is required", call. = FALSE)

added to both functions

  1. I would expose the quiet argument for st_read() in case user needs less verbose output. Default can be FALSE.

added as an argument

  1. Anything making remote requests might need to be wrapped in try() for safety in case network failure, so possibly result could be try-error rather than sf/data.frame

added try() around the function calling the /vsizip//vsicurl/ endpoint. Good call on this one! Now just adds the error to the list.

  1. It seems that some non-spatial tables have list columns, so the st_read() might need some thinking, or at least error handling as above.

This was fixed by changing the function to read_sf() instead of st_read().

  1. There is a R CMD check NOTE about the length of the get_SRI() multi-layer example line >100 characters. This can be split onto multiple lines.

Fixed by splitting

  1. Please also add Roxygen tag @examplesIf curl::has_internet() && requireNamespace("sf")

Added to the Roxygen setup

A few other thoughts that are not necessary:

  1. It might be easier for the user if get_SRI_layers() returned a normal data.frame rather than a sf_layers object. The sf_layers object looks like a list/data.frame but does not have the latter properties in all cases.

worked by adding list2DF().

  1. _It would be nice if the gdb argument wasn't case-sensitive, and there was some validation of the names against the list you have included in the docs rather than a failure w/ the URL.

One option could be to define an (internal) function that returns possible GDB names as a character vector that can then be checked with e.g. match.arg(tolower(gdb), choices = .SRI_gdb_names()). This winds up hard coding the names... but might provide more informative errors to user and allow for a bit more flexibility in input._

Took a stab at this. Seems to be good for both upper and lower case as well as upper and lower case with spaces as 'matching' args, e.g. all good (GiffordPinchot, giffordpinchot, Gifford Pinchot, gifford pinchot).

  1. Any idea if there are other SRI type datasets out there for other FS regions? Just curious as to whether we could further generalize this approach by being able to ID alternate base URL and filenames to use and whether that is something to consider.

Not sure on this one... I can check with some others and see if there is something similar in other regions.

  1. The for-loop is fine as these lists do not have that many elements in them. My personal tendency would be to use lapply() but tomato tomato :)... could remove the second st_read() call for the length 1 case after the else and just return(sri[[1]]) when length(layers)==1. Normally when I do something like this I also have a simplify=TRUE argument required to return which allows it to be returned as list if simplify=FALSE.

I like this. I added simplify to the arguments and now does the above suggestion.

  1. _Curious whether you have thought about what fetchSRI() might look like in the future. Usually we have the fetchX() functions as higher level wrappers around get_X() functions. Most but not all of them return SoilProfileCollections. Here is an idea to think on:

fetchSRI <- function(gdb) { lapply(gdb, function(x) { suppressWarnings(get_SRI(x, layers = get_SRI_layers(x)$name)) }) }_

🤔 Not sure I've thought about this but I think I like it :). I'll leave it up to you.

Thanks for all your insight and support in all this!

joshualerickson avatar Oct 29 '22 16:10 joshualerickson

Thanks for the contribution. Just back from travel, I'll test out this week. What sort of work would these data support? I've never used them and some context would help me evaluate / think of ways to describe in the documentation.

dylanbeaudette avatar Oct 31 '22 16:10 dylanbeaudette

Thanks for the updates Josh.

The only remaining issue is the list2DF in get_SRI_layers() only works with "columns" that have the same number of elements.

Something like this might work more generally--not sure if there is a cleaner/more canonical way to do it

 as.data.frame(sapply(layers, I))

brownag avatar Oct 31 '22 16:10 brownag

Hey @dylanbeaudette, these data would help support areas where SSURGO was not surveyed apparently (I'm not a soils scientist). I used this Soil Resource Inventory (SRI) on a recent Burned Area Emergency Response (BAER) assignment within the Hell's Canyon National Recreation Area (sweet place btw!) to get hydrologic soil group categories for post-fire flow modeling. I usually get this from SSURGO, thanks to you guys, but in this case it wasn't there... From @details;

"Due to the fact that many USFS Region 6 Forests do not have NRCS SSURGO surveys (at a scale of 1:24,000, these are the highest-resolution soils data generally available), USFS Region 6 initiated a project in 2012 to bring these legacy SRI soils data into digital databases to facilitate their use in regional planning activities."

This would fill that void; where SSURGO doesn't exist but SRI does.

Thanks,

Josh

joshualerickson avatar Oct 31 '22 16:10 joshualerickson

Hey @brownag I changed list2DF to use as.data.frame(sapply(layers, I)) instead (I is a cool function; didn't know about it). I also added the fetchSRI() function as a higher level wrapper. I think I like this as it makes it possible to get multiple gdbs if needed. Let me know what you think, thanks!

joshualerickson avatar Nov 02 '22 13:11 joshualerickson

Hey @dylanbeaudette, these data would help support areas where SSURGO was not surveyed apparently (I'm not a soils scientist). I used this Soil Resource Inventory (SRI) on a recent Burned Area Emergency Response (BAER) assignment within the Hell's Canyon National Recreation Area (sweet place btw!) to get hydrologic soil group categories for post-fire flow modeling. I usually get this from SSURGO, thanks to you guys, but in this case it wasn't there... From @details;

"Due to the fact that many USFS Region 6 Forests do not have NRCS SSURGO surveys (at a scale of 1:24,000, these are the highest-resolution soils data generally available), USFS Region 6 initiated a project in 2012 to bring these legacy SRI soils data into digital databases to facilitate their use in regional planning activities."

This would fill that void; where SSURGO doesn't exist but SRI does.

Thanks,

Josh

Thanks Josh. I think that what you and Andrew have put together will be really helpful to our staff and beyond. Thanks again for your contributions.

dylanbeaudette avatar Nov 02 '22 18:11 dylanbeaudette

@joshualerickson This works for me and is basically good to go! Thanks for making all these revisions!

A few more :) tiny things

  • You should re-use .get_SRI_gdb_names() in fetchSRI(); should be able to vectorize the name-cleaning function by using match.arg(..., several.ok=TRUE)
  • If the lapply() in fetchSRI() is changed to sapply() the names of the GDBs will be preserved in the result list object without having to set them manually--on further reflection I think the highest level of the result list should be named, and if .get_SRI_gdb_names() is used as above they will be the standard names
  • In get_SRI() can we add an as.data.frame(sri) so the non-spatial results are data.frame not tibble? This is a read_sf/st_read thing I think. Or just use as_tibble=FALSE

I have a few last minor documentation requests.

  • Put your name in under @author Roxygen tag for each of the .Rd files
  • fetchSRI() documentation @description tag is misspelled, generating a warning
  • First time you reference "gdb" in each .Rd spell out " File Geodatabase" then add parenthetical "(GDB)" and then use the upper-case acronym for subsequent use (where you are not referring to the argument gdb but rather the file names of the geodatabases)
  • To fetchSRI docs add:
@seealso `get_SRI()` `get_SRI_layers()`
  • To get_SRI docs add:
@seealso `get_SRI_layers()`

brownag avatar Nov 02 '22 20:11 brownag

Alright @brownag I think we should be good! But, let me know what you think and we can go from there. Thanks again for working through this with me!

✅ You should re-use .get_SRI_gdb_names() in fetchSRI(); should be able to vectorize the name-cleaning function by using match.arg(..., several.ok=TRUE)

✅ If the lapply() in fetchSRI() is changed to sapply() the names of the GDBs will be preserved in the result list object without having to set them manually--on further reflection I think the highest level of the result list should be named, and if .get_SRI_gdb_names() is used as above they will be the standard names

✅ In get_SRI() can we add an as.data.frame(sri) so the non-spatial results are data.frame not tibble? This is a read_sf/st_read thing I think. Or just use as_tibble=FALSE

✅ Put your name in under @author Roxygen tag for each of the .Rd files

✅ fetchSRI() documentation @description tag is misspelled, generating a warning

✅ First time you reference "gdb" in each .Rd spell out " File Geodatabase" then add parenthetical "(GDB)" and then use the upper-case acronym for subsequent use (where you are not referring to the argument gdb but rather the file names of the geodatabases)

✅ added @seealso to each fetchSRI() and get_SRI()

joshualerickson avatar Nov 03 '22 01:11 joshualerickson

I am very happy with this. Thanks for all the great work @joshualerickson!

brownag avatar Nov 03 '22 02:11 brownag