Slow GetMap requests waiting on catalog updates
GeoServer cloud is being used with a data directory with dangling references, e.g., layer groups pointing to layers that do not exist any longer (an erroneous situation that base GeoServer is tolerant to). In particular, the test is using a file based data directory.
Now, GetMap requests in this situation cat take seconds longer than on a base GeoServer, following this pattern
- The GetMap request is workspace qualified, e.g., "geoserver/
/wms?...." and has several layers, and several styles, with style names being simple names, and referring to styles in the workspace, e.g. styles=myStyleand the style exists in GeoServer asws:myStyle. In other words, the style exists where it shoud be. - The WMSWorkspaceQualifier callback tries to qualify the style turning it into "ws:myStyle" and passing it down to the catalog to verify it's there
- On the way to the facade, the LocalWorkspaceCatalog intercepts the request and turns in into catalog.getStyleByName(WorkspaceInfo, "ws:myStyle"), hence basically doubly-qualifying the reference
- Inside the Eventually consistent catalog facade, the lookup is performed and it returns nothing. The code checks if the catalog facade is consistent, and finds there are pending operations, causing the code to perform a few retries, after 10, 25 and 50 ms (total, 85ms). In the end, the qualified lookup fails and the un-qualified remains (and this works because LocalWorkspaceCatalog will eventually qualify it later)
- With 10-ish styles, this results in a 850ms delay on responses
Now, why is the facade thinking it's not consistent? Because it's full of pending operations that cannot be run, due to the layer groups pointing at non-existing layers. This cannot be resolved.
I found that the data dir loader populates a facade successfully, and it's in the synch operation that all the pending operations are generated. Thinking out loud, when reading a data directory from disk, shouldn't pending references be pruned before the synch? In this particular case, the layer references will just never be there.
Now, why is the facade thinking it's not consistent? Because it's full of pending operations that cannot be run, due to the layer groups pointing at non-existing layers. This cannot be resolved.
Good, we can problably shut off the consistency check (queuing of pending dependant object) during catalog load then But it'd have to be only for objects being loaded, in case at the same time a remote event comes in while the pod is being launched (e.g. scaling out)
Just thinking out loud though, would need to work on a test case to reproduce
Ha, replied before even getting to the end of your comment where you were Thinking out loud
Pheraps an easy way is to use a dedicated synch method? If memory serves me right (not 100% sure), currently the code is synching by executing loops of add operations, as found in RepositoryCatalogFacadeImpl#synchTo, each one becoming a pending operation.
@aaime meanwhile as a workaround, since you mentioned your use case was a static catalog, you can try disabling the functionality by setting geoserver.backend.data-directory.eventual-consistency.enabled=false (or the equivalent env variable GEOSERVER_BACKEND_DATA_DIRECTORY_EVENTUAL_CONSISTENCY_ENABLED=false
If you're testing with the compose/datadir script, editing catalog-datadir.yml as this would do:
diff --git a/compose/catalog-datadir.yml b/compose/catalog-datadir.yml
index 3e64737ebe..9f2a754e11 100644
--- a/compose/catalog-datadir.yml
+++ b/compose/catalog-datadir.yml
@@ -24,6 +24,7 @@ x-gs-dependencies: &gs-dependencies
x-geoserver-env: &geoserver_environment
SPRING_PROFILES_ACTIVE: "${GEOSERVER_DEFAULT_PROFILES},datadir"
+ GEOSERVER_BACKEND_DATA_DIRECTORY_EVENTUAL_CONSISTENCY_ENABLED: false
services:
init-datadir:
here's the externalized configuration: https://github.com/geoserver/geoserver-cloud-config/blob/master/geoserver.yml#L78-L80
when using the development compose files (e.g. cd compose && ./datadir up -d) you can first edit ./config/geoserver.yml to change the config. Since in docker compose the services will fetch their config from the config server, that's all you need. Just remember to run git submodule update --init if you didn't clone with recurse submodules.
Thanks for that. I have two more bits to discuss.
Pending queue is never cleaned up
In terms of a fix for the eventually consistent catalog... as far as I can see, the code is reacting to new catalog events and grabbing the opportunity to resolve some of the pending requests. This happens even when the catalog is queried e.g.:
at org.geoserver.cloud.catalog.backend.datadir.EventualConsistencyEnforcer.tryResolvePending(EventualConsistencyEnforcer.java:130)
at org.geoserver.cloud.catalog.backend.datadir.EventualConsistencyEnforcer.resolvePending(EventualConsistencyEnforcer.java:119)
at org.geoserver.cloud.catalog.backend.datadir.EventualConsistencyEnforcer.forceResolve(EventualConsistencyEnforcer.java:110)
at org.geoserver.cloud.catalog.backend.datadir.EventuallyConsistentCatalogFacade.doRetry(EventuallyConsistentCatalogFacade.java:345)
at org.geoserver.cloud.catalog.backend.datadir.EventuallyConsistentCatalogFacade.retry(EventuallyConsistentCatalogFacade.java:323)
at org.geoserver.cloud.catalog.backend.datadir.EventuallyConsistentCatalogFacade.retryOnNull(EventuallyConsistentCatalogFacade.java:312)
at org.geoserver.cloud.catalog.backend.datadir.EventuallyConsistentCatalogFacade.getStyleByName(EventuallyConsistentCatalogFacade.java:306)
at org.geoserver.catalog.plugin.IsolatedCatalogFacade.getStyleByName(IsolatedCatalogFacade.java:424)
Now... this code is waiting for Godot, in a data directory that has dangling references (not that uncommon in long living data dirs) the target refererence will just never be found. As the code it's writting, it seems to be willing to wait up to 85ms for something missing to show up, and yet the list of pending catalog actions does not have any such limit. IMHO the code should be adapted so that if a certain operation has been in the queue for more than a maximum time, then the references it's waiting for are not going to show up, period. What about marking the operation creation time, and then perform a cleanup of the ones that are too old, marking them as lost updates in the logs?
What about pgconfig
I don't see an eventually consistent behavior in pgconfig, did I miss it? I'm guessing it's working on a differen principle, if soemthing is missing it will be just queried, at worst it may be using a stale memory cache and use information that's not up to date?