dataRetrieval icon indicating copy to clipboard operation
dataRetrieval copied to clipboard

findNLDI returns error when API return empty json

Open mps9506 opened this issue 2 years ago • 20 comments

Hi,

This isn't fully a bug.

Currently findNLDI returns a somewhat cryptic message and errors when making a query that returns nothing (valid query but nothing is found).

For example I know the following will return nothing: https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2

{"type":"FeatureCollection"}

Here is what findNLDI returns when it encounters the same thing:

findNLDI(wqp = "TCEQMAIN-10016",
    nav = "UM",
    find = "nwissite",
    distance_km = 2,
    no_sf = FALSE)

Error: Cannot open "{"type":"FeatureCollection"}"; Check connection parameters.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 1: Invalid FeatureCollection object. Missing 'features' member.
2: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 1: Layer schema generation failed.
3: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 4: Failed to read GeoJSON data

I think for a user expecting a response, the current error and warning messages are hard to troubleshoot. The previous iteration of navigate_nldi in nhdPlusTools returned an empty tibble in a similar scenario: https://github.com/USGS-R/nhdplusTools/issues/97. Having a message indicating that no features were returned and returning a NULL or empty object/list might be beneficial for users.

As always, thanks for this awesome package!!! Happy to make a pull request if there is a way you prefer this handled.

mps9506 avatar Jul 06 '22 21:07 mps9506

Possibly related, or at least could be resolved in the same place in the code https://github.com/internetofwater/nldi-services/issues/315

ksonda avatar Jul 08 '22 13:07 ksonda

@dblodgett-usgs or @mikejohnson51 can one of you take a look?

ldecicco-USGS avatar Jul 08 '22 13:07 ldecicco-USGS

@ldecicco-USGS, @mps9506

Here is a working option that I can send as a PR:

library(dataRetrieval)
findNLDI(wqp = "TCEQMAIN-10016",
              nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = FALSE)

#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-08 by the reprex package (v2.0.1)

The "design" choice I used in this NLDI client was that if data didn't exist it would not be returned. So in the above case, since no upstream gages were found in 2km, there is no nwis data. This is different than the nhdplusTools design in which an empty tibble is returned of all requested data. For my use cases, the existing option is ideal, but I can see many cases where it could raise confusion. I am not wed to one or the other and changing the output to an empty tibble would be easy, as would eliciting a message that no data was found.

What do you think?

mikejohnson51 avatar Jul 08 '22 15:07 mikejohnson51

@mikejohnson51 Thanks! I think this approach plus a message makes sense.

mps9506 avatar Jul 08 '22 16:07 mps9506

Awesome - now we're looking at something like this:

library(dataRetrieval)
xx = findNLDI(wqp = "TCEQMAIN-10016",
              nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = FALSE)
#> No data found for: nwissite?distance=2

xx
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-08 by the reprex package (v2.0.1)

mikejohnson51 avatar Jul 08 '22 17:07 mikejohnson51

The response of {"type":"FeatureCollection"} is an issue with how the API is assembling the response. I would expect it to return something like this for an empty result.

{
    "type": "FeatureCollection",
    "features": []
}

I would expect findNLDI to handle it normally from there. I will create an issue in the NLDI repository to correct this.

EthanGrahn avatar Jul 08 '22 18:07 EthanGrahn

@mps9506 and @EthanGrahn , do you all want to try installing like this:

remotes::install_github("USGS-R/dataRetrieval")

and let us know if you think we can close this issue?

ldecicco-USGS avatar Jul 08 '22 22:07 ldecicco-USGS

I've updated the NLDI to output my previous example for empty results. This seems to introduce a different set of errors. It may be an issue with my tests because I don't typically work with R. Can anybody else confirm this? From what I can tell, findNLDI checks for an empty response (i.e. an empty string) to determine whether no features were found. It should, instead, be accounting for empty GeoJSON.

EthanGrahn avatar Jul 11 '22 19:07 EthanGrahn

I updated to the development version and received the following error:

> findNLDI(wqp = "TCEQMAIN-10016",
+          nav = "UM",
+          find = "nwissite",
+          distance_km = 2,
+          no_sf = FALSE)
Error in if (sf::st_geometry_type(tmp)[1] == "POINT") { : 
  missing value where TRUE/FALSE needed

As @EthanGrahn notes, I believe this is because read_sf() now returns an empty sf object:

> df <- readLines("https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2")
> sf::read_sf(df)
Simple feature collection with 0 features and 0 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
# A tibble: 0 × 1
# … with 1 variable: geometry <GEOMETRY [°]>

> sf::read_sf(df) |> sf::st_coordinates()
     [,1] [,2]

sessionInfo:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dataRetrieval_2.7.11.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3       compiler_4.2.1     pillar_1.7.0       class_7.3-20       remotes_2.4.2     
 [6] tools_4.2.1        jsonlite_1.8.0     lifecycle_1.0.1    tibble_3.1.7       pkgconfig_2.0.3   
[11] rlang_1.0.3        DBI_1.1.3          cli_3.3.0          rstudioapi_0.13    curl_4.3.2        
[16] e1071_1.7-11       dplyr_1.0.9        httr_1.4.3         xml2_1.3.3         generics_0.1.3    
[21] vctrs_0.4.1        classInt_0.4-7     grid_4.2.1         tidyselect_1.1.2   glue_1.6.2        
[26] sf_1.0-7           R6_2.5.1           fansi_1.0.3        purrr_0.3.4        magrittr_2.0.3    
[31] ellipsis_0.3.2     units_0.8-0        assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.2        

mps9506 avatar Jul 11 '22 19:07 mps9506

This raises another question on my end. Would it be better for the end user to receive this empty collection, or be returned a 404 instead?

As a comparison, searching for an invalid source or ID will return a 404 with an error message. https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/fake-id?f=json

EthanGrahn avatar Jul 11 '22 20:07 EthanGrahn

@mps9506 I'll need to dig into why you get that error, on my end I get the desired output and dont see any glaring differences in the package versions:

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
                     nav = "UM",
                     find = "nwissite",
                     distance_km = 2,
                     no_sf = FALSE)
#> No data found for: nwissite?f=json&distance=2
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)


sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dataRetrieval_2.7.11.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3       pillar_1.7.0       compiler_4.1.0     highr_0.9         
#>  [5] class_7.3-20       R.methodsS3_1.8.2  R.utils_2.12.0     tools_4.1.0       
#>  [9] digest_0.6.29      jsonlite_1.8.0     evaluate_0.15      lifecycle_1.0.1   
#> [13] tibble_3.1.7       R.cache_0.15.0     pkgconfig_2.0.3    rlang_1.0.3       
#> [17] reprex_2.0.1       DBI_1.1.3          cli_3.3.0          rstudioapi_0.13   
#> [21] curl_4.3.2         yaml_2.3.5         xfun_0.31          fastmap_1.1.0     
#> [25] e1071_1.7-11       dplyr_1.0.9        withr_2.5.0        styler_1.7.0      
#> [29] stringr_1.4.0      httr_1.4.3         knitr_1.39         xml2_1.3.3        
#> [33] generics_0.1.3     fs_1.5.2           vctrs_0.4.1        tidyselect_1.1.2  
#> [37] grid_4.1.0         classInt_0.4-7     glue_1.6.2         sf_1.0-7          
#> [41] R6_2.5.1           fansi_1.0.3        rmarkdown_2.14     purrr_0.3.4       
#> [45] magrittr_2.0.3     units_0.8-0        ellipsis_0.3.2     htmltools_0.5.2   
#> [49] assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.2         proxy_0.4-27      
#> [53] stringi_1.7.6      crayon_1.5.1       R.oo_1.25.0

Created on 2022-07-11 by the reprex package (v2.0.1)

mikejohnson51 avatar Jul 11 '22 20:07 mikejohnson51

I was just testing this and I think I can reproduce both results. Initially I didn't have sf installed on my computer (I recently updated to R 4.2). Without sf, I get @mikejohnson51 's output:

> findNLDI(wqp = "TCEQMAIN-10016",
+          nav = "UM",
+          find = "nwissite",
+          distance_km = 2,
+          no_sf = FALSE)
            sourceName     identifier    comid                     name
1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
          X       Y
1 -101.3427 35.7422

When I installed sf, ran the same thing:

> xx = findNLDI(wqp = "TCEQMAIN-10016",
+               nav = "UM",
+               find = "nwissite",
+               distance_km = 2,
+               no_sf = FALSE)
 Error in if (sf::st_geometry_type(tmp)[1] == "POINT") { : 
missing value where TRUE/FALSE needed

ldecicco-USGS avatar Jul 13 '22 14:07 ldecicco-USGS

I have this on todays todo list. I have sf installed though so I think I'm missing something else 😭

Here is a reprex using when sf is installed locally:

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
                     nav = "UM",
                     find = "nwissite",
                     distance_km = 2,
                     no_sf = TRUE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE)
#> No data found for: nwissite?f=json&distance=2
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName        identifier comid name      X     Y            geometry
#>   <chr>             <chr>      <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Po… TCEQMAIN-… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)

Created on 2022-07-13 by the reprex package (v2.0.1)

Removing sf, running TRUE/FALSE:

library(dataRetrieval)
remove.packages('sf')
#> Removing package from '/Library/Frameworks/R.framework/Versions/4.2/Resources/library'
#> (as 'lib' is unspecified)
#> Error in find.package(pkgs, lib): there is no package called 'sf'
findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
              find = "nwissite",
              distance_km = 2,
              no_sf = TRUE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE)
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

Created on 2022-07-13 by the reprex package (v2.0.1)

Source of message:

https://github.com/USGS-R/dataRetrieval/blob/7182605f9f45574b768cd21bca376f5678688da6/R/findNLDI.R#L118

mikejohnson51 avatar Jul 13 '22 14:07 mikejohnson51

@mps9506 and @EthanGrahn can you try installing again and see if it looks better?

remotes::install_github("USGS-R/dataRetrieval")

ldecicco-USGS avatar Jul 14 '22 21:07 ldecicco-USGS

I think this works well. I hesitate to ask since I'm not doing any of the work here. But should findNLDI always return a list or is a sf/tbl ok?

Valid result:

findNLDI(wqp = "TCEQMAIN-10032",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = TRUE) |> 
  str()

# List of 2
#  $ origin     :'data.frame':	1 obs. of  6 variables:
#   ..$ sourceName: chr "Water Quality Portal"
#   ..$ identifier: chr "TCEQMAIN-10032"
#   ..$ comid     : chr "19972126"
#   ..$ name      : chr "CANADIAN RIVER AT US 60-83"
#   ..$ X         : num -100
#   ..$ Y         : num 35.9
#  $ UM_nwissite:'data.frame':	1 obs. of  8 variables:
#   ..$ sourceName: chr "NWIS Surface Water Sites"
#   ..$ identifier: chr "USGS-07228000"
#   ..$ comid     : chr "19972126"
#   ..$ measure   : num 79.5
#   ..$ reachcode : chr "11090106000027"
#   ..$ name      : chr "Canadian Rv nr Canadian, TX"
#   ..$ X         : num -100
#   ..$ Y         : num 35.9

compared to:

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = FALSE) |> 
  str()

# sf [1 × 7] (S3: sf/tbl_df/tbl/data.frame)
#  $ sourceName: chr "Water Quality Portal"
#  $ identifier: chr "TCEQMAIN-10016"
#  $ comid     : chr "19970652"
#  $ name      : chr "DIXON CREEK NE OF BORGER"
#  $ X         : num -101
#  $ Y         : num 35.7
#  $ geometry  :sfc_POINT of length 1; first list element:  'XY' num [1:2] -101.3 35.7
#  - attr(*, "sf_column")= chr "geometry"
#  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA
#   ..- attr(*, "names")= chr [1:6] "sourceName" "identifier" "comid" "name" ...
# Warning message:
# No data returned for: https://labs.waterdata.usgs.gov/api/nldi/linked-data/wqp/TCEQMAIN-10016/navigation/UM/nwissite?f=json&distance=2 

mps9506 avatar Jul 14 '22 21:07 mps9506

I think it probably should -- I've only ever worked with no_sf TRUE or FALSE in isolation though so never really thought about the difference. Good catch. What do you think @ldecicco-USGS ?

dblodgett-usgs avatar Jul 14 '22 23:07 dblodgett-usgs

This issue here is that in findNLDI single element returns are given as an object while multi element returns are given as a list. The default find option is flowline. Flowlines (and any non-point objects) can only be parsed when sf is active. So when sf is off, the retuned object is just the origin (single element). When its on you get the origin and upper mainstem (multi element).

mikejohnson51 avatar Jul 14 '22 23:07 mikejohnson51

Right -- but why not return a list with one element when you just have that origin to return?

dblodgett-usgs avatar Jul 14 '22 23:07 dblodgett-usgs

🤷 personal preference haha. I don't like the extra structure when its not needed. I'm happy to change it though if there consensus it makes more sense.

mikejohnson51 avatar Jul 15 '22 01:07 mikejohnson51

Sorry I didn't really voice an opinion on this one. I think @dblodgett-usgs 's suggestion of the list of one makes sense because then scripts won't break if you have one or many returns. @mikejohnson51 does that sound OK? (and more importantly, can you make that change?)

ldecicco-USGS avatar Oct 13 '22 18:10 ldecicco-USGS

@dblodgett-usgs @mikejohnson51 I'm hoping to do an update to dataRetrieval in the next couple of weeks. Should we switch this to a list of 1 element so that the user doesn't can assume a consistent type of return no matter what the size?

ldecicco-USGS avatar Oct 31 '22 17:10 ldecicco-USGS

I think that is the right approach, yeah.

dblodgett-usgs avatar Nov 01 '22 02:11 dblodgett-usgs

PR incoming 😃

library(dataRetrieval)

findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 20,
         no_sf = FALSE, 
         warn = FALSE)
#> $origin
#> Simple feature collection with 1 feature and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.3427 ymin: 35.7422 xmax: -101.3427 ymax: 35.7422
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 7
#>   sourceName           ident…¹ comid name      X     Y            geometry
#>   <chr>                <chr>   <chr> <chr> <dbl> <dbl>         <POINT [°]>
#> 1 Water Quality Portal TCEQMA… 1997… DIXO… -101.  35.7 (-101.3427 35.7422)
#> # … with abbreviated variable name ¹​identifier
#> 
#> $UM_nwissite
#> Simple feature collection with 1 feature and 8 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -101.351 ymin: 35.66476 xmax: -101.351 ymax: 35.66476
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 9
#>   sourceName               identifier    comid measure reach…¹ name      X     Y
#>   <chr>                    <chr>         <chr>   <dbl> <chr>   <chr> <dbl> <dbl>
#> 1 NWIS Surface Water Sites USGS-07227920 1997…    56.0 110901… Dixo… -101.  35.7
#> # … with 1 more variable: geometry <POINT [°]>, and abbreviated variable name
#> #   ¹​reachcode


findNLDI(wqp = "TCEQMAIN-10016",
         nav = "UM",
         find = "nwissite",
         distance_km = 2,
         no_sf = TRUE,
         warn = FALSE)
#> $origin
#>             sourceName     identifier    comid                     name
#> 1 Water Quality Portal TCEQMAIN-10016 19970652 DIXON CREEK NE OF BORGER
#>           X       Y
#> 1 -101.3427 35.7422

Created on 2022-11-01 by the reprex package (v2.0.1)

mikejohnson51 avatar Nov 01 '22 16:11 mikejohnson51