manticoresearch
manticoresearch copied to clipboard
searchd closed, when manually issue 'JOIN CLUSTER' following a pod failure.
Describe the bug Was unable to manually issue 'JOIN CLUSTER' following a pod failure.
I THINK it might of been memory related. See https://forum.manticoresearch.com/t/when-a-node-joins-a-cluster-what-happens-if-alrady-local-indexes/1155 for context. When rejoining the cluster the worker pod still had local indexes. Possibly the replication needed too much memory to resync the files on JOIN'ing.
So searchd was killed for using too much memory, rather than actually crashing.
Once I deleted the local indexes, was able to join the cluster successfully.
To Reproduce
Describe the environment:
- Manticore Search version: 5.0.0 b4cb7da02@220518 release
- OS version: Manticore Search Helm Chart Version: 5.0.0.2
Messages from log files: https://staging.data.geograph.org.uk/facets/manticorert2.2022-08-24.log.filtered.txt This is the entireity of the searchd.log being able to recover (complicated as the pod puts query_log into the same stream, so had to filter out queries - we have multi-line queries, so tricky!)
The first KILL is known. That is when I inserted too much data, the rt_mem_limit exceeded the resources.limit for the worker pod.
The second KILL is when I tried to get the worker pod to rejoin the cluster manually.
The 'drop gridprefix' syntax error is when searchd has come back just after the second KILL. It was me attempting to delete the local indexes to retry joining the cluster.
I dont know why the logs end there. I get nothing after that.
Additional context
Add any other context about the problem here.
In case you've faced a crash what indextool --check
returns.
And here is hte log from the 'donor node' the node I tried to connect the failed node https://staging.data.geograph.org.uk/facets/manticorert1.2022-08-24.log.filtered.txt
Can see that the [WARNING: '10.72.45.210:9312': agent closed connection ] - ie htat that the agent I was trying to rejoin. But crashed (well was terminated!) during the join process.
Nothing about the second (sucessfull!) join attempt.
@barryhunter it might make sense to run the searchd with --logreplication
and reproduce the issue.
Hmm, not sure how add command line switches with the helm chart
We actully user Flux Helm Controller, so can't actually edit the chart sources (just override values.yaml), so cant modify supervisord.conf. https://gist.github.com/barryhunter/a1375c3238bc26eb25645f2e68161d8e
Hmm, wonder if can try to replicate it on play.manticoresearch.com :) Bit much at the moment to try getting my own cluster going so could use helm directly.
Best if that can be reproduced without k8s at all. It's not a big deal to run 3 instances on 3 different ports on the same server and play with them including killing one of the instances hardly emulating OOM caused by cgroups (resources.limit
) etc. If the issue can be reproduced this way it will be much easier to debug and fix it.
Ok good point. Yes should be able to manage that.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Feel free to re-open the issue in case it becomes actual.