pecan [WIP] gSSURGO Enhancements: SOC Integration, enhanced XML Parsing, and Ensemble improvements

Description

This PR significantly improves the extract_soil_gssurgo function for generating soil property ensembles from gSSURGO data. It introduces robust XML parsing, support for soil organic carbon (SOC) calculation, improved uncertainty handling, dynamic depth handling, and fixes for ensemble generation and NetCDF output.

Key Enhancements:

Added Fields: Queries chorizon.om_r (organic matter) and chorizon.dbthirdbar_r (bulk density).
SOC Calculation:
- Uses the Van Bemmelen factor (SOC = OM / 1.724).
- Computes SOC stock:
```
horizon_thickness_m * (soc_percent / 100) * bulk_density * 10  
```
Simulates SOC ensembles via gamma distribution for positive-only values.
Avoids memory leaks and malformed XML errors by Temp File handling.
Extracts mukeys from raw XML text if standard parsing fails.
Automatically extends depths vector if observed soil exceeds user-specified layers.
Uses pmax/pmin to ensure valid depth indices, avoiding subscript out of bounds
Valid sizein Handling: Coerces ensemble size to ≥1 to prevent seq_len() errors.
Proper mukey_area Initialization: Creates from simulated data instead of filtering undefined variables.
Uses min(x, nrow(.x)) to avoid invalid row access.
Includes SOC by removing [1:4] subsetting.

Motivation and Context

Review Time Estimate

[ ] Immediately
[ ] Within one week
[ ] When possible

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[ ] My change requires a change to the documentation.
[ ] My name is in the list of CITATION.cff
[ ] I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
[ ] I have updated the CHANGELOG.md.
[ ] I have updated the documentation accordingly.
[ ] I have read the CONTRIBUTING document.
[ ] I have added tests to cover my changes.
[ ] All new and existing tests passed.

Jun 05 '25 11:06 divine7022

@divine7022 Thank you for these improvements!!!

could you please:

[x] resolve conflicts in extract_soil_nc.R?
[x] Add your name to the function authors and CITATION.cff if you haven't already done so
[x] Update Changelog
[x] Add tests for new functionality? These don't have to comprehensively cover cornder cases, mostly aimed at identifying if any essential features are broken in the future.

Jun 16 '25 19:06 dlebauer

Alternatively we also had an option : for more spatially accurate and high-throughput workflows, we can extract MUKEYs and soil properties directly from a local gSSURGO raster (.tif) file (e.g, MapunitRaster_10m.tif):

this ensures perfect alignment with the official gSSURGO raster grid provided by NRCS.
it is computationally efficient for large areas or bulk processing.
however, it requires downloading and storing raster files locally, which can significantly increase disk usage.

I'm currently using a grid based sampling approach to retrieve MUKEYs and associated soil properties:

a grid is generated, centered on the target lat/lon.
for each grid cell center, lat/lon coordinates are calculated.
and then retrieve the corresponding MUKEYs for each location using WFS (Web Feature Service) server requests.
this method enables spatially flexible, sampling and is suitable when local raster data( if we use it in future) are not available or when disk space is limited

Jul 12 '25 22:07 divine7022

heads up; After investigating the gSSURGO unit specifications, I've confirmed the correct units from https://www.nrcs.usda.gov/sites/default/files/2022-08/SSURGO-Metadata-Tables-and-Columns-Report.pdf

soil_depth = "hzdept_r" (cm) soil_depth_bottom = "hzdepb_r" (cm) bulk_density = "dbthirdbar_r" (g/cm3)

Updated SOC calculation

      horizon_thickness_cm = soil_depth_bottom - soil_depth,
      # Van Bemmelen factor conversion: OM to SOC
      soc_percent = organic_matter_pct / 1.724,
      # SOC(kg/m2) =  SOC(frac) × bulk density(kg/m3) × depth(m)
      soc_kg_m2 =
        PEcAn.utils::ud_convert(horizon_thickness_cm, "cm", "m") *
        (soc_percent / 100) *
        PEcAn.utils::ud_convert(bulk_density, "g cm-3", "kg m-3"),
      soil_organic_carbon_stock = soc_kg_m2

Aug 19 '25 12:08 divine7022

@dlebauer @infotroph thanks for all your valuable suggestions i hope all the concerns have been addressed; can you merge this ?

Aug 19 '25 12:08 divine7022

tl;dr: The proposed sampling will not work and needs a change of approach.

I'm sorry that it took me so long to start reviewing and then so long after that to understand the limitations of your proposed method. Now that I've wrapped my head around it, let me lay out the context:

The first thing to know is that this function is secretly misnamed: gSSURGO is "gridded SSURGO", but the existing gSSURGO.Query function does not perform any raster operations at all. It submits (unprojected) point coordinates and a buffer radius (in meters) to the NRCS WFS server, which looks it up in the non-gridded SSURGO soil map and returns all the polygons that intersect the circular buffer. If you knew this already then great; if you were expecting the sampling process to be operating on a raster map then this is an important difference. Notice especially that circular buffers with radius grid_spacing placed grid_spacing apart (as you're doing here) produces heavily overlapping queries, so small changes in grid placement can make a given feature be counted in anywhere between 1 and 4 grid points.

The second thing to know is that I broke the existing code three years ago!

The original pre-2022 design was to submit a WFS request returning a GML file that contains all the full polygons of every map unit intersecting the buffer, then parse those into spatial objects and intersect each of them with the buffer to get their they contribute to the sample area.

In #2964 I apparently misunderstood all of that (though given my comment, apparently my spidey sense did tingle a little) and thought that all the local spatial operations did nothing but extract the mukeys, so I deleted all of them and changed the WFS query to have it return the mukeys directly instead of GML. At some point after that was merged, I noticed that this left mukey_area undefined, felt embarrassed but was glad that it would break loudly rather than return incorrect results, and put it on my list to fix eventually.

I'm really, really glad you've stepped up to fix it! All my comments here are trying to be sure the fix returns valid results.

I see three options to get to valid results:

Minimal change from current state of the PR: reduce query radius to grid_spacing / 2 or less, so that queries don't overlap, and possibly increase the number of grid points. My gut is you need a lot more than 9 points to get a reliable estimate of variability, but I don't like the prospect of needing >9x as many queries to do something the WFS server can do in one request.
Closest to original intention: Undo a bunch of my 2022 deletions; go back to requesting a GML result containing the full polygons; process them locally to determine relative area. Note that this will be more complicated than just reverting my commit, because Hamze's code used the rgdal package that is now retired.
Biggest change but biggest long-term improvement, should be a different PR: Delegate all the sSSURGO details to the soilDB R package, which is very good and is written and maintained by folks who know SSURGO inside-out.
- I think the core query would be something like soilDB::mukey.wcs(list(c(xmin, ymin, xmax, ymax), crs = "EPSG:4326")), which returns a list of mukeys for each raster pixel and can then be aggregated the same as your grid-sampled results
- property retrieval for each desired layer would be something like soilDB::get_SDA_property(mukeys = keys, property = c("om_r", "dbthirdbar_r", ...), top_depth=layer_top, bottom_depth = layer_bottom, method= "Weighted Average", ...). If we don't like their aggregation methods, we could set method = "None" and process the individual component horizons as desired.

Aug 26 '25 17:08 infotroph