manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

FATAL: cluster cmd: 5, error: query timed out

Open calvin2021y opened this issue 2 years ago • 8 comments

Describe the bug after one node offline and online again, it start joining to cluster.

at the end the state change into destroyed.

from logs

mc1  | FATAL: 'cluster1' cluster [192.168.1.58:37960], cmd: 5, error: '192.168.1.56:9312': query timed out

To Reproduce

I don't know how I get into this problem. It repeat every time if I bring the node online again. (join take a long time, then destroyed)

Describe the environment:

  • docker manticoresearch/manticore:6.2.12
  • alpine linux

Messages from log files: Messages from searchd.log and query.log (if applicable).

mc1  | DEBUG: preread table 'shard1_chat_message' in 389.159 sec
mc1  | DEBUG: prereading table 'shard5_chat_message'
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1555'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1555)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1558'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1558)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1540'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1540)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1574'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1574)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1575'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1575)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1576'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1576)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1579'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1579)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1580'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1580)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1566'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1566)
mc1  | DEBUG: Preread successfully finished
mc1  | DEBUG: CSphIndex_VLN::Preread invoked '/var/lib/manticore/shard5_chat_message/shard5_chat_message.1582'(/var/lib/manticore/shard5_chat_message/shard5_chat_message.1582)
mc1  | FATAL: 'cluster1' cluster [192.168.1.58:37960], cmd: 5, error: '192.168.1.56:9312': query timed out
mc1  | DEBUG: Detached::RemoveThread called for 82
mc1  | DEBUG: Terminated thread 82, 'cluster1_repl_2'
mc1  | DEBUG: Preread successfully finished

calvin2021y avatar Nov 14 '23 14:11 calvin2021y

the join query keep return

ERROR 2013 (HY000) at line 1: Lost connection to server during query

calvin2021y avatar Nov 14 '23 18:11 calvin2021y

Is there a method to start rejoin "destroyed" state node without restart ?

join query return:

ERROR 1064 (42000) at line 1: cluster 'cluster3' already exists

after restart, I get all cluster rejoin for very long time.

calvin2021y avatar Nov 14 '23 18:11 calvin2021y

you need to restart nodes with --logreplication to enable cluster related events at all nodes when upload searchd.log from all nodes to investigate issue.

tomatolog avatar Nov 14 '23 20:11 tomatolog

Thanks for the tip.

The cluster already setup, I will add --logreplication next time.

I can not restart it now, every node start will take 1 hours to join. I use 6 instance and 6 cluster. here is the layout.

cluster0 = instance0 instance1 instance2
cluster1 = instance1 instance2 instance3
cluster2 = instance2 instance3 instance4
cluster3 = instance3 instance4 instance5
cluster4 = instance4 instance5 instance0 
cluster5 = instance5 instance0 instance1 

I has 2 question:

  1. when run the join, every instance need 1 hours or more to be synced. every instance I have 84GB data(every local table 28G, each node has 3 local table), in SSD disk. is this normal ?

  2. when run the join, some time return query timeout. This is the private network and I also run minio at same instance (they are use they own disk, so IO should be isolation) . is there a way to speed up the join process and avoid query timeout ?

calvin2021y avatar Nov 15 '23 05:11 calvin2021y

Join limited by IO of the disk that reads sha1 of index chunks first then IO of the disk and network bandwidth to read missed index chunks and transfer them by network. You could copy all indexes from the data_dir of the donor instance into the new instance data_dir to speed up SST phase of the join.

tomatolog avatar Nov 15 '23 08:11 tomatolog

could you also post your config to make sure what timeout do you have? could you also show output of the SphinxQL statement

SHOW SETTINGS;

tomatolog avatar Nov 16 '23 08:11 tomatolog

I use the default config, with buddy_path = set to empty

+ mysql -h mcc -P 9306 -e 'SHOW SETTINGS;'
+--------------------------+-------------------------------------+
| Setting_name             | Value                               |
+--------------------------+-------------------------------------+
| configuration_file       | /etc/manticoresearch/manticore.conf |
| worker_pid               | 1                                   |
| searchd.data_dir         | /var/lib/manticore                  |
| searchd.listen           | 9306:mysql41                        |
| searchd.listen           | /var/run/mysqld/mysqld.sock:mysql41 |
| searchd.listen           | 192.168.1.55:9312                   |
| searchd.listen           | 9308:http                           |
| searchd.log              | /var/log/manticore/searchd.log      |
| searchd.max_packet_size  | 128M                                |
| searchd.pid_file         | /var/run/manticore/searchd.pid      |
| searchd.query_log_format | sphinxql                            |
| searchd.buddy_path       |                                     |
| searchd.binlog_path      | /var/lib/manticore/binlog           |
| common.plugin_dir        | /usr/local/lib/manticore            |
+--------------------------+-------------------------------------+

calvin2021y avatar Nov 20 '23 05:11 calvin2021y

I will add --logreplication next time.

Did you have a chance to do it?

sanikolaev avatar Nov 28 '23 08:11 sanikolaev

This issue is outdated. I'm closing it. Feel free to reopen if you can provide more info on the matter.

sanikolaev avatar Feb 15 '24 08:02 sanikolaev