manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

searchd closed, when manually issue 'JOIN CLUSTER' following a pod failure.

Open barryhunter opened this issue 2 years ago • 5 comments

Describe the bug Was unable to manually issue 'JOIN CLUSTER' following a pod failure.

I THINK it might of been memory related. See https://forum.manticoresearch.com/t/when-a-node-joins-a-cluster-what-happens-if-alrady-local-indexes/1155 for context. When rejoining the cluster the worker pod still had local indexes. Possibly the replication needed too much memory to resync the files on JOIN'ing.

So searchd was killed for using too much memory, rather than actually crashing.

Once I deleted the local indexes, was able to join the cluster successfully.

To Reproduce

Describe the environment:

  • Manticore Search version: 5.0.0 b4cb7da02@220518 release
  • OS version: Manticore Search Helm Chart Version: 5.0.0.2

Messages from log files: https://staging.data.geograph.org.uk/facets/manticorert2.2022-08-24.log.filtered.txt This is the entireity of the searchd.log being able to recover (complicated as the pod puts query_log into the same stream, so had to filter out queries - we have multi-line queries, so tricky!)

The first KILL is known. That is when I inserted too much data, the rt_mem_limit exceeded the resources.limit for the worker pod.

The second KILL is when I tried to get the worker pod to rejoin the cluster manually.

The 'drop gridprefix' syntax error is when searchd has come back just after the second KILL. It was me attempting to delete the local indexes to retry joining the cluster.

I dont know why the logs end there. I get nothing after that.

Additional context Add any other context about the problem here. In case you've faced a crash what indextool --check returns.

barryhunter avatar Aug 25 '22 10:08 barryhunter

And here is hte log from the 'donor node' the node I tried to connect the failed node https://staging.data.geograph.org.uk/facets/manticorert1.2022-08-24.log.filtered.txt

Can see that the [WARNING: '10.72.45.210:9312': agent closed connection ] - ie htat that the agent I was trying to rejoin. But crashed (well was terminated!) during the join process.

Nothing about the second (sucessfull!) join attempt.

barryhunter avatar Aug 25 '22 10:08 barryhunter

@barryhunter it might make sense to run the searchd with --logreplication and reproduce the issue.

sanikolaev avatar Aug 26 '22 08:08 sanikolaev

Hmm, not sure how add command line switches with the helm chart

We actully user Flux Helm Controller, so can't actually edit the chart sources (just override values.yaml), so cant modify supervisord.conf. https://gist.github.com/barryhunter/a1375c3238bc26eb25645f2e68161d8e

Hmm, wonder if can try to replicate it on play.manticoresearch.com :) Bit much at the moment to try getting my own cluster going so could use helm directly.

barryhunter avatar Aug 26 '22 10:08 barryhunter

Best if that can be reproduced without k8s at all. It's not a big deal to run 3 instances on 3 different ports on the same server and play with them including killing one of the instances hardly emulating OOM caused by cgroups (resources.limit) etc. If the issue can be reproduced this way it will be much easier to debug and fix it.

sanikolaev avatar Aug 26 '22 11:08 sanikolaev

Ok good point. Yes should be able to manage that.

barryhunter avatar Aug 26 '22 11:08 barryhunter

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Feel free to re-open the issue in case it becomes actual.

stale[bot] avatar Oct 01 '22 07:10 stale[bot]