ows4R
ows4R copied to clipboard
WFS paging and parallelization support
Hi @eblondel ,
I have been giving a try to ows4r
to query biological occurrence data from EMODnet-Biology
In this example below, I requested:
- Dataset: The CPR survey (https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=216)
- Geographical: North Sea (https://marineregions.org/gazetteer.php?p=details&id=2350)
- Taxon: Calanus finmarchicus (https://www.marinespecies.org/aphia.php?p=taxdetails&id=104464)
I got a WFS request using the EMODnet-Biology download toolbox (at the end of the selection, you can copy the WFS request in "Get webservice url")
Good news are that viewParams
via vendor params work like a charm! (although I have to watch out for the encoding https://github.com/lifewatch/eurobis/issues/15#issuecomment-1081925137)
I am having troubles however with the paging and parallel options. After some debugging, I think the issue might be that ows4r
is relying on a param named numberMatched
when using resultstype = "hits"
at: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L240
And this is not being returned geo.vliz.be (should happen around: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L291)
Could you have a look and see what is happening?
Thanks a lot!
# Example get CPR dataset, North Sea and Calanus finmarchicus
library(ows4R)
library(parallel)
# URL as provided by download toolbox
url_download_toolbox <- "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aeurobis-obisenv_basic&resultType=results&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&propertyName=datasetid%2Cdatecollected%2Cdecimallatitude%2Cdecimallongitude%2Ccoordinateuncertaintyinmeters%2Cscientificname%2Caphiaid%2Cscientificnameaccepted&outputFormat=csv"
URLdecode(url_download_toolbox)
#> [1] "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal:eurobis-obisenv_basic&resultType=results&viewParams=where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464&propertyName=datasetid,datecollected,decimallatitude,decimallongitude,coordinateuncertaintyinmeters,scientificname,aphiaid,scientificnameaccepted&outputFormat=csv"
# Only params
params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"
URLdecode(params)
#> [1] "where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464"
# Create wfs client and find feature
wfs <- WFSClient$
new("https://geo.vliz.be/geoserver/Dataportal/wfs", "1.1.0", logger = "INFO")$
getCapabilities()$
findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")
#> [ows4R][INFO] OWSGetCapabilities - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&request=GetCapabilities
# Create cluster
cl <- makeCluster(detectCores() - 1)
# Perform tests: around 20K rows
system.time(feature_only_viewparams <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSDescribeFeatureType - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&request=DescribeFeatureType
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#> user system elapsed
#> 0.990 0.100 3.712
system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resulttype=hits&request=GetFeature
#> Error in seq.default(from = 0, to = numberMatched, by = paging_length): 'to' must be of length 1
#> Timing stopped at: 0.09 0.001 0.678
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results",
parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#> user system elapsed
#> 0.986 0.088 3.429
# Debugging pagination
nft <- wfs$getFeatures(viewParams = params, resultType="hits")
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=hits&request=GetFeature
names(nft)
#> [1] "numberOfFeatures" "timeStamp"
"numberMatched" %in% names(nft)
#> [1] FALSE
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] httr_1.4.2 reprex_2.0.1 ows4R_0.2-1 keyring_1.3.0 geometa_0.6-6
#>
#> loaded via a namespace (and not attached):
#> [1] tinytex_0.35 tidyselect_1.1.1 xfun_0.28 purrr_0.3.4
#> [5] sf_0.9-4 lattice_0.20-41 vctrs_0.3.8 generics_0.1.0
#> [9] htmltools_0.5.0 yaml_2.2.1 utf8_1.2.2 XML_3.99-0.3
#> [13] rlang_0.4.11 e1071_1.7-3 pillar_1.6.3 glue_1.4.2
#> [17] withr_2.4.2 DBI_1.1.1 bit64_4.0.5 sp_1.4-6
#> [21] lifecycle_1.0.1 evaluate_0.14 knitr_1.29 tzdb_0.1.2
#> [25] callr_3.7.0 ps_1.6.0 curl_4.3 class_7.3-17
#> [29] fansi_0.5.0 highr_0.8 Rcpp_1.0.7 readr_2.0.2
#> [33] KernSmooth_2.23-17 openssl_1.4.2 classInt_0.4-3 vroom_1.5.5
#> [37] jsonlite_1.7.0 bit_4.0.4 fs_1.5.0 hms_1.1.1
#> [41] askpass_1.1 digest_0.6.25 processx_3.5.2 dplyr_1.0.7
#> [45] grid_3.6.3 rgdal_1.5-12 cli_3.0.1 tools_3.6.3
#> [49] magrittr_2.0.1 tibble_3.1.5 crayon_1.4.1 pkgconfig_2.0.3
#> [53] ellipsis_0.3.2 assertthat_0.2.1 rmarkdown_2.11 rstudioapi_0.13
#> [57] R6_2.5.1 units_0.6-7 compiler_3.6.3
Created on 2022-03-29 by the reprex package (v2.0.1)
This issue partly follows up #29
@salvafern make sure to use WFS 2.0 version; AFAIK pagination in WFS is only supported in WFS 2.0, I see you used 1.1.0
Try with setting version 2.0.0 like this:
wfs <- WFSClient$
new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
getCapabilities()$
findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")
params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"
#with pagination
system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))
justed tested the pagination and it worked
Indeed now it works, thanks a lot! I was using v1.1.0 to copy what the download toolbox did, but I guess there's no harm in using v2.0.0
I have also tried now using the parellel options:
Using parellelization and pagination together
Probably I'm doing something wrong. I expected that multiple requests would be done for each chunk, but I just ran into an error.
library(ows4R)
library(parallel)
wfs <- WFSClient$
new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
getCapabilities()$
findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")
# Querying dataset: https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=8020
# ~500K rows
params <- "where%3Adatasetid+IN+%288020%29"
# With pagination and parellelization
cl <- makeCluster(detectCores() - 1)
cl
#> socket cluster with 15 nodes on host ‘localhost’
debug(wfs$getFeatures)
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results",
paging = TRUE, paging_length = 10000,
parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> No layers in datasource.
#> Timing stopped at: 0.023 0 11.45
via debug()
I can see that at some point, a request of type 'hits' is read with sf::st_read()
, which of course fails. This happens at https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L328
The response in destfile
looks like
<?xml version="1.0" encoding="UTF-8"?>
<wfs:FeatureCollection
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fes="http://www.opengis.net/fes/2.0"
xmlns:wfs="http://www.opengis.net/wfs/2.0"
xmlns:gml="http://www.opengis.net/gml/3.2"
xmlns:ows="http://www.opengis.net/ows/1.1"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" numberMatched="408603" numberReturned="0" timeStamp="2022-03-31T07:57:57.251Z" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd"/>
Using only parallelization
I tried comparing no parallelization vs parallelization with mclapply
and parLapply
but I'm not seeing any improvement on the performance. Probably it needs pagination as well?
# No pagination nor parellelization
system.time(feature <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature
#> user system elapsed
#> 26.718 2.080 67.476
# Parallelization parLapply
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results",
parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature
#> user system elapsed
#> 27.457 2.477 65.883
# Parallelization mclapply
system.time(feature_parallel2 <- wfs$getFeatures(viewParams = params, resultType="results",
parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature
#> user system elapsed
#> 26.226 2.274 63.895
Many thanks again for the help! Let me know if I there is anything I can do.
Yes, sounds they are issues with the parallelization, will have a look asap.
If you want to use the cluster approach, you can use this handler : parallel::parLapply
which works with cluster. mclapply can't work apparently because I didn't allow specifying the extra args needed for this handler
I got the same error :(
feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results",
paging = TRUE, paging_length = 10000,
parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl)
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> No layers in datasource.
@salvafern i don't forget this, i started working on it, but still looking into the best way to fix the parallel handlers.