manticoresearch
manticoresearch copied to clipboard
FATAL: cluster cmd: 5, error: query timed out
Describe the bug after one node offline and online again, it start joining to cluster.
at the end the state change into destroyed.
from logs
mc1 | FATAL: 'cluster1' cluster [192.168.1.58:37960], cmd: 5, error: '192.168.1.56:9312': query timed out
To Reproduce
I don't know how I get into this problem. It repeat every time if I bring the node online again. (join take a long time, then destroyed)
Describe the environment:
- docker manticoresearch/manticore:6.2.12
- alpine linux
Messages from log files: Messages from searchd.log and query.log (if applicable).
mc1 | DEBUG: preread table 'shard1_chat_message' in 389.159 sec
mc1 | DEBUG: prereading table 'shard5_chat_message'
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1555'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1555)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1558'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1558)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1540'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1540)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1574'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1574)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1575'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1575)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1576'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1576)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1579'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1579)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1580'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1580)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1566'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1566)
mc1 | DEBUG: Preread successfully finished
mc1 | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1582'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1582)
mc1 | FATAL: 'cluster1' cluster [192.168.1.58:37960], cmd: 5, error: '192.168.1.56:9312': query timed out
mc1 | DEBUG: Detached::RemoveThread called for 82
mc1 | DEBUG: Terminated thread 82, 'cluster1_repl_2'
mc1 | DEBUG: Preread successfully finished
the join query keep return
ERROR 2013 (HY000) at line 1: Lost connection to server during query
Is there a method to start rejoin "destroyed" state node without restart ?
join query return:
ERROR 1064 (42000) at line 1: cluster 'cluster3' already exists
after restart, I get all cluster rejoin for very long time.
you need to restart nodes with --logreplication to enable cluster related events at all nodes when upload searchd.log from all nodes to investigate issue.
Thanks for the tip.
The cluster already setup, I will add --logreplication next time.
I can not restart it now, every node start will take 1 hours to join. I use 6 instance and 6 cluster. here is the layout.
cluster0 = instance0 instance1 instance2
cluster1 = instance1 instance2 instance3
cluster2 = instance2 instance3 instance4
cluster3 = instance3 instance4 instance5
cluster4 = instance4 instance5 instance0
cluster5 = instance5 instance0 instance1
I has 2 question:
-
when run the join, every instance need 1 hours or more to be synced. every instance I have 84GB data(every local table 28G, each node has 3 local table), in SSD disk. is this normal ?
-
when run the join, some time return query timeout. This is the private network and I also run minio at same instance (they are use they own disk, so IO should be isolation) . is there a way to speed up the join process and avoid query timeout ?
Join limited by IO of the disk that reads sha1 of index chunks first then IO of the disk and network bandwidth to read missed index chunks and transfer them by network.
You could copy all indexes from the data_dir of the donor instance into the new instance data_dir to speed up SST phase of the join.
could you also post your config to make sure what timeout do you have? could you also show output of the SphinxQL statement
SHOW SETTINGS;
I use the default config, with buddy_path = set to empty
+ mysql -h mcc -P 9306 -e 'SHOW SETTINGS;'
+--------------------------+-------------------------------------+
| Setting_name | Value |
+--------------------------+-------------------------------------+
| configuration_file | /etc/manticoresearch/manticore.conf |
| worker_pid | 1 |
| searchd.data_dir | /var/lib/manticore |
| searchd.listen | 9306:mysql41 |
| searchd.listen | /var/run/mysqld/mysqld.sock:mysql41 |
| searchd.listen | 192.168.1.55:9312 |
| searchd.listen | 9308:http |
| searchd.log | /var/log/manticore/searchd.log |
| searchd.max_packet_size | 128M |
| searchd.pid_file | /var/run/manticore/searchd.pid |
| searchd.query_log_format | sphinxql |
| searchd.buddy_path | |
| searchd.binlog_path | /var/lib/manticore/binlog |
| common.plugin_dir | /usr/local/lib/manticore |
+--------------------------+-------------------------------------+
I will add --logreplication next time.
Did you have a chance to do it?
This issue is outdated. I'm closing it. Feel free to reopen if you can provide more info on the matter.