Review and hardening of harvesting engine
Introduction
The harvesting engine has been introduced recently in the master branch, which will be delivered with GN 4.x. This service is in charge of harvesting resources from remote services, based on several configurations options driving its scheduling, the ingestion logic, and the management of updates.
It's based on several Celery mechanisms, which are employed to orchestrate and perform the tasks. It also supports the implementation of custom harvesters, beyond the ones implemented in GeoNode (WMS, GeoNode, ArcGIS Server).
It's also the engine behind the Remote Services. With the introduction of the harvesting engine, these have become a simplified interface on top of WMS, GeoNode, and ArcGIS harvester instances.
Issues
The GeoNode master demo instance has been used extensively to test the harvesters, by configuring several Remote Services and some other harvesters (configured through the Django admin). This stress testing revealed some fragilities in the management of harvester jobs and the execution of some of them. In particular, we noticed problems with:
- some jobs for the retrieval harvestable resources get stuck
- some jobs for the retrieval of updates to the configured harvestable resources get stuck
- the state of the harvesters is not always restored completely after forcing the ending of ongoing jobs or scheduled jobs
Analysis and hardening
We want to investigate if any of the following factors can lead failures with the management of harvesting jobs, and in what measure:
- restarts of GeoNode and/or Celery Docker services (for ex. during redeploys)
- network latencies/errors between GeoNode, RabbitMQ and Celery workers
- number of concurrently scheduled jobs
As a result of the analysis, we want to implement any useful hardening to mitigate the reported problems and improve the reliability of the Harvesting engine.
@giohappy Besides the examples in the docs, do you have more harvesters sources that I can test/try?
On Friday (1st April) our https://atlas.thuenen.de GeoNode instance will be publicly available. It runs at 3.3.x
@italogsfernandes
- https://risk.spc.int/ (GeoNode harvester)
- http://ihp-wins.unesco.org (GeoNode, but I'm not 100% sure this version is compliant with the expected 3.3.x version for the harvester)
- https://www.geonode-gfdrrlab.org/geoserver/ows (WMS)
- http://ihp-wins.unesco.org/geoserver/ows (WMS)
For the three cases to be analyzed:
- restarts of GeoNode and/or Celery Docker services (for ex. during redeploys):
- The job gets stuck, and after restarting celery it is not started again, so we need to "re-scan" the harvest source. In this case, if I remember well, celery raises an exception when a termination process is started, this exception can be used to handle some cases. Also, when starting the celery job, an automatic job for restarting the stopped ones can be added.
- network latencies/errors between GeoNode, RabbitMQ and Celery workers
- It was not verified.
- number of concurrently scheduled jobs
- I didn't find any problem here.
As we suspected the restarts are the main issue here. @italogsfernandes in theory a self-healing solution would be needed here, to let Celery restore the status of running harvester jobs, but it would be quite complex work. For the moment the simplest solution would be to cleanup the status of any "running" job after a restart. Scheduled harvesters will restart the jobs by themselves. Unscheduled jobs (e.g. the one created when harvesting resources for a Remote Sergice) will have to be re-run manually.
What's your opinion?
I would like to also hear an opinion from @ricardogsilva on this.
@giohappy I agree with you that the simple solution of manually cleaning up the status of any job that happened to be running when a restart had been performed is likely the way to go for now.
Implementing some sort of self-healing would be nice, but if it is for recovering from a hard restart of celery I think implementation would be a bit complex. Maybe we can simply have some sort of post-restart script that automatically resets the statuses back to READY and re-launches jobs immediately? Despite bringing this up I'm not 100% on whether this would be a good approach though. Sometimes requiring human intervention after a restart/reboot is not a bad idea - I guess we can discuss this further and try to come up with a better solution
For the moment the proposal is to just do a clean-up of the jobs when celery restarts. Unscheduled jobs will have to be rerun manually.
@ricardogsilva @afabiani the proposal from @italogsfernandes is to adopt the following for celery restarts:
celery --workdir /usr/src/geonode --app geonode.celery_app purge -f
if you think it would work let's create e PR for it.