dataverse-kubernetes icon indicating copy to clipboard operation
dataverse-kubernetes copied to clipboard

Expose Solr readiness/liveness, performance metrics and more to K8s/Prometheus

Open poikilotherm opened this issue 5 years ago • 2 comments

This is related to #82 and #85.

Very profund and good article with loads of usable stuff: https://lucidworks.com/post/running-solr-on-kubernetes-part-1/

poikilotherm avatar Sep 05 '19 12:09 poikilotherm

Documenting special handling of Solr in 729a0033 here.

Without any readinessProbe or livenessProbe, the old Solr container will be killed immediately after we start a new one to replace it (e.g. during update). The few seconds between terminating the one old container and core loading on the new instance was sufficient to release the IndexWriter lock (/data/index/write.lock)

Now as we have probes, the old container will not be killed before the new is not ready to serve requests (checked by getting system info). (We cannot ping the core, as it would block due to the lock). The time between "ready" and termination is in turn too large to have the lock released on time for core loading.

Here the livenessProbe kicks in: check will fail, and the container is restarted because of failure treshold set to 1. This restarts the container and that has been enough time to release the lock.

I dunno if a RELOAD would be ok, too - I wanted to make sure that it reaches a workable state again. This might bite back someday... Maybe switch to SolrCloud anyway.

poikilotherm avatar Oct 17 '19 16:10 poikilotherm

For the Prometheus exporter: https://lucene.apache.org/solr/guide/7_6/monitoring-solr-with-prometheus-and-grafana.html

Not sure if this should run in a sidecar or be a separate deployment.

poikilotherm avatar Oct 17 '19 19:10 poikilotherm