pygeoapi split_catchment does not return same geometry as get_basin
What happened?
I utilized your tutorial to split a small watershed at the location of the gage. The geometry results of split_catchment do not match those of the NLDI get_basin. I pulled the NHD flowlines for high and medium resolution. It looks like the difference in geometry is associated with the difference in flowline resolution. Is it safe to assume that split_catchment does not use high resolution flowlines/DEM data for delineation?
What did you expect to happen?
I expected to be able to split the basin at the gage to get upstream watershed geometry that matches the NLDI get_basin geometry.
I was able to use split_catchment successfully on two other small watersheds with no discrepancies. It's possible I found an edge case with this specific gage? I wouldn't expect the difference to matter much on larger watersheds but it could potentially on small watersheds.
Minimal Complete Verifiable Example
from pygeohydro import NWIS
from pynhd import NLDI
import geopandas as gpd
import shapely.geometry as sgeom
import pynhd
import matplotlib.pyplot as plt
station_id = 'USGS-06896500'
site = NWIS().get_info({'site': station_id.split('-')[-1]})
basin = NLDI().get_basins(station_id, split_catchment=False, simplified=False)
gdf = gpd.GeoDataFrame(
{
"upstream": [
False,
]
},
geometry = [sgeom.Point((site.geometry.iloc[0].coords[0]))],
crs=site.crs,
)
split = pynhd.pygeoapi(gdf, 'split_catchment')
nhd_hr = pynhd.NHDPlusHR("flowline")
nhd_mr = pynhd.NHD('flowline_mr')
flw_mr = nhd_mr.bygeom(basin.geometry.iloc[0], basin.crs)
flw_hr = nhd_hr.bygeom(basin.geometry.iloc[0], basin.crs)
ax = split.plot(figsize=(8, 8), facecolor="lightgrey", edgecolor="black", alpha=0.8) #, label='pygeoapi split_catchment')
site.plot(ax=ax, color='r')
basin.boundary.plot(ax=ax, color='r', label='get_basin similified=False')
plt.title(f'{station_id} get_basin vs split_catchment comparison')
flw_hr.plot(ax=ax, color='C0', label='flowline hr')
flw_mr.plot(ax=ax, color='C2', label='flowline mr')
plt.legend(loc='upper left')
MVCE confirmation
- [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue.
- [x] Complete example — the example is self-contained, including all data and the text of any traceback.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
Anything else we need to know?
No response
Environment
SYS INFO
commit: None python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_United States', '1252')
PACKAGE VERSION
async-retriever 0.19.1 pygeoogc 0.19.0 pygeoutils 0.19.0 py3dep 0.19.0 pynhd 0.19.0 pygridmet N/A pydaymet N/A hydrosignatures 0.19.0 pynldas2 N/A pygeohydro 0.19.0 aiohttp 3.11.11 aiohttp-client-cache 0.12.4 aiosqlite 0.20.0 cytoolz 1.0.1 ujson 5.10.0 defusedxml 0.7.1 joblib 1.4.2 multidict 6.1.0 owslib 0.32.1 pyproj 3.7.0 requests 2.32.3 requests-cache 1.2.1 shapely 2.0.6 url-normalize 1.4.3 urllib3 2.3.0 yarl 1.18.3 geopandas 1.0.1 netcdf4 1.7.2 numpy 2.2.2 rasterio 1.4.3 rioxarray 0.18.2 scipy 1.15.1 xarray 2025.1.1 click 8.1.8 networkx 3.4.2 pyarrow 19.0.0 folium 0.19.4 h5netcdf 1.4.1 matplotlib 3.10.0 pandas 2.2.2 numba N/A py7zr N/A pyogrio 0.10.0
Is it safe to assume that split_catchment does not use high resolution flowlines/DEM data for delineation?
Right, the backbone network of NLDI is currently NHD MR. They are working on adding support for HR too. Currently, there aren't many web services that use the HR version since the network itself is still a work in progress and missing some important elements such as catchments.
Also, the source code for the split catchment functionality that PyGeoAPI uses is available here. You can install and work with it locally!
EDIT: I will look into the discrepancy that you mentioned and let you know if I find anything interesting.
I have encountered an error with the example code above that I did not encounter when I originally posted. It seems the split_catchment is not working correctly for this gage's location data. I am able to use split_catchment just fine using the same code with 2 other gages.
This figure is what was produced with the code in my original post.
However, when I run the code today (and remove some of the additional plotting) this is what I get. The basin is "split" but only into a small box that intersects the gage location data.
The split catchment is really a work around I'm using for small watersheds whose extent is more impacted by the imprecision of the NLDI delineated catchments. I did apply your answer from my other question and used gagesii which DOES give the watershed as delineated from the actual gage location. However, some of the gages I am interested in (like the one in this example) are not in the gagesii database.
I looked at the USGS page for my current example gage and another that I have noticed are not represented accurately with the watershed spatial data that I pull. In the meta data section, the watershed area given DOES match the area delineated from the specific gage location. So USGS has this data and presumably (?) the accompanying geospatial data. Where is the proper place to get that information programmatically? I don't see it in the Information web service. gagesii seems to be the right place for spatial data for gages that meet their criteria. split_catchment is an acceptable work around, but I don't know why the behavior has changed in my example.
The problem with split catchment is that it can be difficult to make it work at scale without tweaking parameters. Anders Hopkins, who is the developer of nldi-flowlines, has been doing a fantastic job of addressing similar issues. I sent him around 200 cases that did not work correctly, and he was able to fix most of them. Regarding the station area that NWIS provides, you can use pygeohydro.NWIS. They do not have the geometry, but they contain drainage area estimates. Just keep in mind that GagesII basins are derived from NHD v1 whereas NLDI is derived from NHD v2. So, there's going to be differences between what you get from WaterData and NLDI, regardless of the split issue.
That said, here's the code I normally use for catchment splitting that uses PyWBT package.
from pynhd import NLDI
from pygeohydro import NWIS
import pywbt
import seamless_3dep as s3dep
from tempfile import TemporaryDirectory
import geopandas as gpd
import shapely
id = "06896500"
basin = NLDI().get_basins(f"USGS-{id}", simplified=False)
geom = basin.to_crs(4326).union_all()
station = NWIS().get_info({"site": id})
dem_paths = s3dep.get_dem(geom.bounds, "data")
dem = s3dep.tiffs_to_da(dem_paths, geom.bounds)
with TemporaryDirectory(dir=".") as tmpdir:
dem_name = f"{tmpdir}/dem.tif"
dem.rio.to_raster(dem_name)
outlet_name = f"{tmpdir}/outlet.shp"
station[["geometry"]].to_file(outlet_name)
wbt_args = {
"BreachDepressions": [f"-i={dem_name}", "--fill_pits", "-o=dem_corr.tif"],
"D8Pointer": ["-i=dem_corr.tif", "-o=fdir.tif"],
"Watershed": [
"--d8_pntr=fdir.tif",
f"--pour_pts={outlet_name}",
"-o=watersheds.tif",
],
"RasterToVectorPolygons": [
"-i=watersheds.tif",
"-o=watersheds.shp",
],
}
save_dir = "data"
pywbt.whitebox_tools(
tmpdir, wbt_args, files_to_save=("watersheds.shp",), save_dir=save_dir
)
split = gpd.read_file("data/watersheds.shp")
ax = basin.plot(color="none", edgecolor="black", lw=2)
split.plot(ax=ax, color="none", edgecolor="blue", lw=2)
station.plot(ax=ax, color="red", markersize=50)
ax.set_axis_off()
Great thank you for the information. I have come across WBT before but have never used it.
Regarding getting area from pygeohydro.NWIS, I had not turned on the nhd_info flag so thanks for pointing that out. But my question remains. I accessed area that way, but it still does not match the area that I find on the site page. For example, the drainage area for this gage as shown on it's site web page, is less than half the area that I get through pygeohydro.NWIS. As I understand it, the issue is related to how the NLDI indexing works, but it doesn't explain where USGS is getting the area they are displaying on the site web page. I don't think that is related to NHD v1 vs v2 is it? I'm not stressing over it with these other workarounds, but it's still unclear to me where the "real" watershed size above the gage that is being displayed on the website actually lives.
As an inelegant solution, I tried moving the point location by small amounts and retrying split_catchment. For whatever reason, I got it to work again after moving the point 1 m..
offset_m = 1
x_m = sites[sites.site_no == gage].geometry.to_crs(5070).iloc[0].x + 0
y_m = sites[sites.site_no == gage].geometry.to_crs(5070).iloc[0].y - offset_m
new_pts_df = gpd.GeoDataFrame({'geometry': sgeom.Point((x_m, y_m))}, index=['test'], crs='5070')
new_pts_df = new_pts_df.to_crs(crs)
###
gdf = gpd.GeoDataFrame(
{
"upstream": [
True,
]
},
# geometry = [sgeom.Point((sites[sites.site_no == gage].geometry.iloc[0].coords[0]))],
geometry = [sgeom.Point((new_pts_df.geometry.iloc[0].x, new_pts_df.geometry.iloc[0].y))],
crs=crs,
)
Right, I didn't explicitly answer your question. The issue with drainage area is one of those fundamental issues in hydrology. There are so many edge cases both computationally and theoretically that you cannot confidentially judge and define what's the "correct" value. So, discrepancies are bound to exist.
For your specific question, the main reason that splitting is very difficult to generalize is the "snapping" part. When you specify a point, that point has to be moved (snapped) to a streamflow. This snapping has to be done based on the drainage network (flow direction to be more specific) derived for the area. This snapping should also follow the natural overland flow direction, so you have to do a tracing operation to find the nearest downstream point on the drainage network from the point of interest, which can be very tricky. If you look at the nldi-flowlines source, there's a complex logic for this exact purpose that has some hard coded (arbitrary) values to generalize this to some extent. In the code that I shared, the main reason that it worked better than nldi-flowlines is that I am using the 10-m resolution DEM whereas the split_catchment web service uses the flow direction dataset (FDR) generated for the NHD V2 which is from 30-m DEM. With FDR derived from 10-m DEM, it's easier to snap (which WBT does under-the-hood) to the drainage network. In your workaround, by moving the point, you actually helped with the snapping part which made it work. For this reason, I often prefer to rely on the 10-m derived FDR using WBT, especially when I am dealing with many points that manual editing of points' location is not an option.
@cheginit Thanks for taking the time to answer, provide additional insight, and share your code. I also hadn't noticed your release of seamless-3dep
Sure! Right, I recently released four new packages. In addition to Seamless3DEP and PyWBT that I already mentioned, I've released TinyRetriever and CatSmoothing.