WFS paging and parallelization support

Open salvafern opened this issue 2 years ago • 7 comments

Hi @eblondel ,

I have been giving a try to ows4r to query biological occurrence data from EMODnet-Biology

In this example below, I requested:

  • Dataset: The CPR survey (https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=216)
  • Geographical: North Sea (https://marineregions.org/gazetteer.php?p=details&id=2350)
  • Taxon: Calanus finmarchicus (https://www.marinespecies.org/aphia.php?p=taxdetails&id=104464)

I got a WFS request using the EMODnet-Biology download toolbox (at the end of the selection, you can copy the WFS request in "Get webservice url")

Good news are that viewParams via vendor params work like a charm! (although I have to watch out for the encoding https://github.com/lifewatch/eurobis/issues/15#issuecomment-1081925137)

I am having troubles however with the paging and parallel options. After some debugging, I think the issue might be that ows4r is relying on a param named numberMatched when using resultstype = "hits" at: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L240

And this is not being returned geo.vliz.be (should happen around: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L291)

Could you have a look and see what is happening?

Thanks a lot!

# Example get CPR dataset, North Sea and Calanus finmarchicus


# URL as provided by download toolbox
url_download_toolbox <- "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aeurobis-obisenv_basic&resultType=results&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&propertyName=datasetid%2Cdatecollected%2Cdecimallatitude%2Cdecimallongitude%2Ccoordinateuncertaintyinmeters%2Cscientificname%2Caphiaid%2Cscientificnameaccepted&outputFormat=csv"
#> [1] "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal:eurobis-obisenv_basic&resultType=results&viewParams=where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464&propertyName=datasetid,datecollected,decimallatitude,decimallongitude,coordinateuncertaintyinmeters,scientificname,aphiaid,scientificnameaccepted&outputFormat=csv"

# Only params
params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"
#> [1] "where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464"

# Create wfs client and find feature
wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "1.1.0", logger = "INFO")$
#> [ows4R][INFO] OWSGetCapabilities - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&request=GetCapabilities

# Create cluster
cl <- makeCluster(detectCores() - 1)

# Perform tests: around 20K rows
system.time(feature_only_viewparams <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSDescribeFeatureType - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&request=DescribeFeatureType 
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.990   0.100   3.712

system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resulttype=hits&request=GetFeature
#> Error in seq.default(from = 0, to = numberMatched, by = paging_length): 'to' must be of length 1
#> Timing stopped at: 0.09 0.001 0.678

system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.986   0.088   3.429

# Debugging pagination
nft <- wfs$getFeatures(viewParams = params, resultType="hits")
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=hits&request=GetFeature
#> [1] "numberOfFeatures" "timeStamp"

"numberMatched" %in% names(nft)
#> [1] FALSE

This issue partly follows up #29

salvafern avatar Mar 29 '22 15:03 salvafern

@salvafern make sure to use WFS 2.0 version; AFAIK pagination in WFS is only supported in WFS 2.0, I see you used 1.1.0

eblondel avatar Mar 30 '22 14:03 eblondel

Try with setting version 2.0.0 like this:

wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$

   params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"

   #with pagination
   system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))

justed tested the pagination and it worked

eblondel avatar Mar 30 '22 19:03 eblondel

Indeed now it works, thanks a lot! I was using v1.1.0 to copy what the download toolbox did, but I guess there's no harm in using v2.0.0

I have also tried now using the parellel options:

Using parellelization and pagination together

Probably I'm doing something wrong. I expected that multiple requests would be done for each chunk, but I just ran into an error.


wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$

# Querying dataset: https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=8020
# ~500K rows
params <- "where%3Adatasetid+IN+%288020%29"

# With pagination and parellelization
cl <- makeCluster(detectCores() - 1)
#> socket cluster with 15 nodes on host ‘localhost’

system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                paging = TRUE, paging_length = 10000,
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.
#> Timing stopped at: 0.023 0 11.45

via debug() I can see that at some point, a request of type 'hits' is read with sf::st_read(), which of course fails. This happens at https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L328

The response in destfile looks like

<?xml version="1.0" encoding="UTF-8"?>
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" numberMatched="408603" numberReturned="0" timeStamp="2022-03-31T07:57:57.251Z" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd"/>

Using only parallelization

I tried comparing no parallelization vs parallelization with mclapply and parLapply but I'm not seeing any improvement on the performance. Probably it needs pagination as well?

# No pagination nor parellelization
system.time(feature <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.718   2.080  67.476

# Parallelization parLapply
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 27.457   2.477  65.883

# Parallelization mclapply
system.time(feature_parallel2 <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                 parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.226   2.274  63.895 

Many thanks again for the help! Let me know if I there is anything I can do.

salvafern avatar Mar 31 '22 08:03 salvafern

Yes, sounds they are issues with the parallelization, will have a look asap.

eblondel avatar Mar 31 '22 09:03 eblondel

If you want to use the cluster approach, you can use this handler : parallel::parLapply which works with cluster. mclapply can't work apparently because I didn't allow specifying the extra args needed for this handler

eblondel avatar Mar 31 '22 13:03 eblondel

I got the same error :(

feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                    paging = TRUE, paging_length = 10000,
                                    parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl)
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.

salvafern avatar Mar 31 '22 14:03 salvafern

@salvafern i don't forget this, i started working on it, but still looking into the best way to fix the parallel handlers.

eblondel avatar Apr 08 '22 15:04 eblondel