manticoresearch
manticoresearch copied to clipboard
Replication issue while sharding with Buddy
When we use 5 nodes and create shards with rf=3
and shards=60
we have race condition crash in random time
Here are logs that I captured from the actual case
To reproduce it, we can run with CLT
clt/clt test -d -t test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
I also added some cases to test replication but failed to reproduce the issue. Maybe actual logs will give more information about case why the cluster becomes non-primary
I used bash script to setup it manually
#!/usr/bin/env bash
pgrep -f searchd | xargs kill -9
sleep 2
rm -fr /var/{run,lib,log}/manticore-{1,2,3,4,5}
mkdir -p /var/{run,lib,log}/manticore-{1,2,3,4,5}
for i in 1 2 3 4 5; do
searchd -c test/clt-tests/sharding/base/config/searchd-$i.conf --logreplication
done
mysql -h0 -P19306 -e "create cluster c"
for i in 2 3 4 5; do
mysql -h0 -P${i}9306 -e "join cluster c at '127.0.0.1:19312'"
done
And after it when we do create table with sharding, we get issue
create table c:t shards=60 rf=3
Sample config:
common {
plugin_dir = /usr/local/lib/manticore
lemmatizer_base = /usr/share/manticore/morph/
}
searchd {
listen = 19306:mysql41
listen = 19312
listen = 19308:http
log = /var/log/manticore-1/searchd.log
query_log = /var/log/manticore-1/query.log
pid_file = /var/run/manticore-1/searchd.pid
data_dir = /var/lib/manticore-1
query_log_format = sphinxql
query_log_commands = 1
buddy_path = manticore-executor /workdir/src/main.php
}
Buddy must be used from branch fix/sharding
and manually installed with all composer packages
I just discovered that probably this test is covered with the following test:
clt/clt test -d -t test/clt-tests/sharding/cluster/attach-table-while-joining-cluster.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
https://github.com/manticoresoftware/manticoresearch/actions/runs/6416223732/job/17420158212
P.S. Not 100% of times, but often
- 0
+ ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
+ 0
––– input –––
for i in 1 2 3; do tail -n 100 /var/log/manticore-${i}/searchd.log; done
––– output –––
+ [Thu Oct 5 08:28:58.167 2023] [565] watchdog: main process 566 forked ok
+ [Thu Oct 5 08:28:58.168 2023] [566] starting daemon version '6.2.13 fc0b860f3@231002 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)' ...
+ [Thu Oct 5 08:28:58.169 2023] [566] listening on all interfaces for mysql, port=19306
+ [Thu Oct 5 08:28:58.169 2023] [566] listening on all interfaces for sphinx and http(s), port=19312
+ [Thu Oct 5 08:28:58.169 2023] [566] listening on all interfaces for sphinx and http(s), port=19308
+ [Thu Oct 5 08:28:58.191 2023] [571] prereading 0 tables
+ [Thu Oct 5 08:28:58.191 2023] [571] preread 0 tables in 0.000 sec
+ [Thu Oct 5 08:28:58.193 2023] [566] accepting connections
+ [Thu Oct 5 08:28:58.207 2023] [577] [BUDDY] started v1.1.1 'manticore-executor /workdir/src/main.php --listen=http://0.0.0.0:19312 --threads=12' at http://127.0.0.1:38375
+ [Thu Oct 5 08:28:58.263 2023] [573] [BUDDY] Loaded plugins:
+ [Thu Oct 5 08:28:58.263 2023] [573] [BUDDY] core: empty-string, backup, emulate-elastic, distributed-insert, insert, select, show, cli-table, plugin, test, insert-mva, create-table
+ [Thu Oct 5 08:28:58.263 2023] [573] [BUDDY] local:
+ [Thu Oct 5 08:28:58.263 2023] [573] [BUDDY] extra:
+ [Thu Oct 5 08:29:02.066 2023] [692] WARNING: 3c9f4089 sending install message failed: Resource temporarily unavailable
+ [Thu Oct 5 08:29:02.566 2023] [693] WARNING: Member 2.0 (node_127.0.0.1_a_658) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
+ [Thu Oct 5 08:28:59.281 2023] [607] watchdog: main process 608 forked ok
+ [Thu Oct 5 08:28:59.282 2023] [608] starting daemon version '6.2.13 fc0b860f3@231002 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)' ...
+ [Thu Oct 5 08:28:59.282 2023] [608] listening on all interfaces for mysql, port=29306
+ [Thu Oct 5 08:28:59.282 2023] [608] listening on all interfaces for sphinx and http(s), port=29312
+ [Thu Oct 5 08:28:59.282 2023] [608] listening on all interfaces for sphinx and http(s), port=29308
+ [Thu Oct 5 08:28:59.304 2023] [613] prereading 0 tables
+ [Thu Oct 5 08:28:59.304 2023] [613] preread 0 tables in 0.000 sec
+ [Thu Oct 5 08:28:59.306 2023] [608] accepting connections
+ [Thu Oct 5 08:28:59.320 2023] [619] [BUDDY] started v1.1.1 'manticore-executor /workdir/src/main.php --listen=http://0.0.0.0:29312 --threads=12' at http://127.0.0.1:36047
+ [Thu Oct 5 08:28:59.377 2023] [611] [BUDDY] Loaded plugins:
+ [Thu Oct 5 08:28:59.377 2023] [611] [BUDDY] core: empty-string, backup, emulate-elastic, distributed-insert, insert, select, show, cli-table, plugin, test, insert-mva, create-table
+ [Thu Oct 5 08:28:59.377 2023] [611] [BUDDY] local:
+ [Thu Oct 5 08:28:59.377 2023] [611] [BUDDY] extra:
+ [Thu Oct 5 08:29:02.574 2023] [710] WARNING: Member 2.0 (node_127.0.0.1_a_658) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
+ [Thu Oct 5 08:29:02.576 2023] [711] WARNING: replication error unknown table 't3', command 3
+ [Thu Oct 5 08:29:02.576 2023] [711] FATAL: Failed to apply trx: source: 3c9f4089-6359-11ee-86a1-c26d8a5020c7 version: 4 local: 0 state: COMMITTING flags: 65 conn_id: 3 trx_id: -1 seqnos (l: 3, g: 3, s: 2, d: 2, ts: 26959116430503837)
+ [Thu Oct 5 08:29:02.576 2023] [711] FATAL: Commit failed. Trx: 0x7febec000c20 (FATAL)
+ at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:apply_trx():516
+ [Thu Oct 5 08:29:02.576 2023] [711] FATAL: Node consistency compromised, aborting...
+ [Thu Oct 5 08:29:07.577 2023] [711] WARNING: abort from cluster 'a'
+ [Thu Oct 5 08:29:07.577 2023] [610] FATAL: 'a' cluster after join error: , nodes '127.0.0.1:19322'
+ [Thu Oct 5 08:29:07.578 2023] [607] watchdog: main process 608 killed dirtily with signal 6, will be restarted
+ [Thu Oct 5 08:29:07.578 2023] [607] watchdog: main process 713 forked ok
+ [Thu Oct 5 08:29:07.579 2023] [713] starting daemon version '6.2.13 fc0b860f3@231002 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)' ...
+ [Thu Oct 5 08:29:07.579 2023] [713] listening on all interfaces for mysql, port=29306
+ [Thu Oct 5 08:29:07.579 2023] [713] listening on all interfaces for sphinx and http(s), port=29312
+ [Thu Oct 5 08:29:07.579 2023] [713] listening on all interfaces for sphinx and http(s), port=29308
+ [Thu Oct 5 08:29:07.601 2023] [714] WARNING: table 't1': disabled at the JSON config
+ [Thu Oct 5 08:29:07.601 2023] [714] WARNING: table 't1': prealloc: failed to open /var/lib/manticore-2/t1/t1.lock: No such file or directory - NOT SERVING
+ [Thu Oct 5 08:29:07.601 2023] [714] WARNING: table 't2': disabled at the JSON config
+ [Thu Oct 5 08:29:07.601 2023] [714] WARNING: table 't2': prealloc: failed to open /var/lib/manticore-2/t2/t2.lock: No such file or directory - NOT SERVING
+ [Thu Oct 5 08:29:07.601 2023] [718] binlog: replaying log /var/lib/manticore-2/binlog/binlog.001
+ [Thu Oct 5 08:29:07.601 2023] [718] binlog: replay stats: 0 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 0 tables
+ [Thu Oct 5 08:29:07.601 2023] [718] binlog: finished replaying /var/lib/manticore-2/binlog/binlog.001; 0.0 MB in 0.000 sec
+ [Thu Oct 5 08:29:07.601 2023] [718] binlog: finished replaying total 1 in 0.000 sec
+ [Thu Oct 5 08:29:07.601 2023] [720] prereading 0 tables
+ [Thu Oct 5 08:29:07.601 2023] [720] preread 0 tables in 0.000 sec
+ [Thu Oct 5 08:29:07.604 2023] [713] accepting connections
+ [Thu Oct 5 08:29:07.618 2023] [715] [BUDDY] started v1.1.1 'manticore-executor /workdir/src/main.php --listen=http://0.0.0.0:29312 --threads=12' at http://127.0.0.1:39859
+ [Thu Oct 5 08:29:07.773 2023] [725] [BUDDY] Loaded plugins:
+ [Thu Oct 5 08:29:07.773 2023] [725] [BUDDY] core: empty-string, backup, emulate-elastic, distributed-insert, insert, select, show, cli-table, plugin, test, insert-mva, create-table
+ [Thu Oct 5 08:29:07.773 2023] [725] [BUDDY] local:
+ [Thu Oct 5 08:29:07.773 2023] [725] [BUDDY] extra:
+ [Thu Oct 5 08:29:00.417 2023] [650] watchdog: main process 651 forked ok
+ [Thu Oct 5 08:29:00.417 2023] [651] starting daemon version '6.2.13 fc0b860f3@231002 dev (columnar 2.2.5 b8be4eb@230928) (secondary 2.2.5 b8be4eb@230928)' ...
+ [Thu Oct 5 08:29:00.417 2023] [651] listening on all interfaces for mysql, port=39306
+ [Thu Oct 5 08:29:00.417 2023] [651] listening on all interfaces for sphinx and http(s), port=39312
+ [Thu Oct 5 08:29:00.417 2023] [651] listening on all interfaces for sphinx and http(s), port=39308
+ [Thu Oct 5 08:29:00.438 2023] [656] prereading 0 tables
+ [Thu Oct 5 08:29:00.438 2023] [656] preread 0 tables in 0.000 sec
+ [Thu Oct 5 08:29:00.441 2023] [651] accepting connections
+ [Thu Oct 5 08:29:00.455 2023] [662] [BUDDY] started v1.1.1 'manticore-executor /workdir/src/main.php --listen=http://0.0.0.0:39312 --threads=12' at http://127.0.0.1:35263
+ [Thu Oct 5 08:29:00.511 2023] [657] [BUDDY] Loaded plugins:
+ [Thu Oct 5 08:29:00.511 2023] [657] [BUDDY] core: empty-string, backup, emulate-elastic, distributed-insert, insert, select, show, cli-table, plugin, test, insert-mva, create-table
+ [Thu Oct 5 08:29:00.511 2023] [657] [BUDDY] local:
+ [Thu Oct 5 08:29:00.511 2023] [657] [BUDDY] extra:
+ [Thu Oct 5 08:29:02.566 2023] [708] WARNING: Member 2.0 (node_127.0.0.1_a_658) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
blocked by https://github.com/manticoresoftware/dev/issues/416 as that branch has a lot of changes in replication
could you check current master version https://github.com/manticoresoftware/manticoresearch/commit/4083d028e0cd3b3361160a99cfddd842d0e09120 of daemon?
as for me I see the reply
stas@dev2:~/bin/rpl$ mysql -h0 -P31106 -e "create table c:t shards=2 rf=3"
ERROR 1064 (42000) at line 1: Waiting timeout exceeded.
and at the daemon log I see that something (buddy?) commits into data that got replicated at cluster. However I can not check the command sequence to make sure what is happening.
I use buddy from branch fix/sharding
and manually installed with all composer packages then use this config for all daemons
query_log_commands = 1
buddy_path = manticore-executor /home/stas/buddy/src/main.php
started 3 daemons and issue
stas@dev2:~/bin/rpl$ mysql -h0 -P31106 -e "create cluster c"
stas@dev2:~/bin/rpl$ mysql -h0 -P31206 -e "join cluster c at '127.0.0.1:31112'"
stas@dev2:~/bin/rpl$ mysql -h0 -P31306 -e "join cluster c at '127.0.0.1:31112'"
then create table c:t shards=2 rf=3
finished with the Waiting timeout exceeded
error
I need command sequence that I could run by hand and check what cause Waiting timeout exceeded
or could create a ticket that query_log_commands = 1
logs also HTTP requests along with SphinxQL
As discussed, @tomatolog will try to reproduce on dev2 using https://github.com/manticoresoftware/manticoresearch/blob/tests/sharding/test/clt-tests/sharding/cluster/attach-table-while-joining-cluster.rec
I use bash script gh1485.zip extracted from the attach-table-while-joining-cluster.rec and do not see any issues sometime cluster refuses to add indexes with the error message as join operation is in progress but this way you need just to retry after join finishes. I do not see any crashes in the current master at any nodes. Here is the output that show that cluster opeable and all nodes works as usual
stas@dev2:~/bin/rpl$ ./gh1485.sh
--------------
delete cluster a
--------------
Query OK, 0 rows affected (5.50 sec)
Bye
--------------
delete cluster a
--------------
ERROR 1064 (42000) at line 1: (null)
Bye
--------------
delete cluster a
--------------
ERROR 1064 (42000) at line 1: (null)
Bye
--------------
create cluster a
--------------
Query OK, 0 rows affected (0.44 sec)
Bye
--------------
drop table if exists t1
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
drop table if exists t2
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
drop table if exists t3
--------------
Query OK, 0 rows affected (0.00 sec)
Bye
--------------
create table t1
--------------
Query OK, 0 rows affected (0.00 sec)
--------------
create table t2
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
create table t3
--------------
Query OK, 0 rows affected (0.00 sec)
Bye
--------------
show tables
--------------
+-------+------+
| Index | Type |
+-------+------+
| t1 | rt |
| t2 | rt |
| t3 | rt |
| t_s0 | rt |
| t_s1 | rt |
| test2 | rt |
+-------+------+
6 rows in set (0.00 sec)
Bye
--------------
alter cluster a add t1
--------------
--------------
join cluster a at '127.0.0.1:31112'
--------------
--------------
alter cluster a add t3
--------------
--------------
alter cluster a add t2
--------------
--------------
join cluster a at '127.0.0.1:31112'
--------------
Query OK, 0 rows affected
Bye
Query OK, 0 rows affected
Bye
Query OK, 0 rows affected
Bye
Query OK, 0 rows affected
Bye
Query OK, 0 rows affected
Bye
0
--------------
show status like '%_status'
--------------
+-------------------------------+---------+
| Counter | Value |
+-------------------------------+---------+
| command_status | 12 |
| cluster_a_status | primary |
| cluster_a_flow_control_status | OFF |
| cluster_a_ist_receive_status | |
+-------------------------------+---------+
4 rows in set (0.00 sec)
Bye
--------------
show status like '%_node_state'
--------------
+----------------------+--------+
| Counter | Value |
+----------------------+--------+
| cluster_a_node_state | synced |
+----------------------+--------+
1 row in set (0.00 sec)
Bye
--------------
show status like '%_local_state_comment'
--------------
+-------------------------------+--------+
| Counter | Value |
+-------------------------------+--------+
| cluster_a_local_state_comment | Synced |
+-------------------------------+--------+
1 row in set (0.00 sec)
Bye
--------------
show status like '%_status'
--------------
+-------------------------------+---------+
| Counter | Value |
+-------------------------------+---------+
| command_status | 10 |
| cluster_a_status | primary |
| cluster_a_flow_control_status | OFF |
| cluster_a_ist_receive_status | |
+-------------------------------+---------+
4 rows in set (0.00 sec)
Bye
--------------
show status like '%_node_state'
--------------
+----------------------+--------+
| Counter | Value |
+----------------------+--------+
| cluster_a_node_state | synced |
+----------------------+--------+
1 row in set (0.00 sec)
Bye
--------------
show status like '%_local_state_comment'
--------------
+-------------------------------+--------+
| Counter | Value |
+-------------------------------+--------+
| cluster_a_local_state_comment | Synced |
+-------------------------------+--------+
1 row in set (0.00 sec)
Bye
--------------
show status like '%_status'
--------------
+-------------------------------+---------+
| Counter | Value |
+-------------------------------+---------+
| command_status | 11 |
| cluster_a_status | primary |
| cluster_a_flow_control_status | OFF |
| cluster_a_ist_receive_status | |
+-------------------------------+---------+
4 rows in set (0.00 sec)
Bye
--------------
show status like '%_node_state'
--------------
+----------------------+--------+
| Counter | Value |
+----------------------+--------+
| cluster_a_node_state | synced |
+----------------------+--------+
1 row in set (0.00 sec)
Bye
--------------
show status like '%_local_state_comment'
--------------
+-------------------------------+--------+
| Counter | Value |
+-------------------------------+--------+
| cluster_a_local_state_comment | Synced |
+-------------------------------+--------+
1 row in set (0.00 sec)
Bye
provide script that reproduces this issue at the master version or check the master version and close the issue
root@9ade6f37ac5d:/# cat /var/log/manticore/searchd2.log
[Mon Jan 8 10:53:01.968 2024] [1122] watchdog: main process 1123 forked ok
[Mon Jan 8 10:53:01.969 2024] [1123] starting daemon version '6.2.13 ea9081bb3@240108 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204)' ...
[Mon Jan 8 10:53:01.969 2024] [1123] listening on 127.0.0.1:29312 for sphinx and http(s)
[Mon Jan 8 10:53:01.969 2024] [1123] listening on 127.0.0.1:29306 for mysql
[Mon Jan 8 10:53:01.969 2024] [1123] listening on 127.0.0.1:29308 for sphinx and http(s)
[Mon Jan 8 10:53:02.007 2024] [1127] prereading 0 tables
[Mon Jan 8 10:53:02.007 2024] [1127] preread 0 tables in 0.000 sec
[Mon Jan 8 10:53:02.009 2024] [1123] accepting connections
[Mon Jan 8 10:53:02.037 2024] [1130] [BUDDY] started v2.0.9 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://127.0.0.1:29312 --threads=12' at http://127.0.0.1:35245
[Mon Jan 8 10:53:02.037 2024] [1130] [BUDDY] Loaded plugins:
core: empty-string, backup, emulate-elastic, insert, alias, select, show, cli-table, plugin, test, insert-mva, create-table
local:
extra:
[Mon Jan 8 10:53:02.230 2024] [1135] WARNING: No persistent state found. Bootstraping with default state
[Mon Jan 8 10:53:03.243 2024] [1225] WARNING: Member 1.0 (node_127.0.0.1_a_1135) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
[Mon Jan 8 10:53:03.274 2024] [1128] FATAL: 'a' cluster [127.0.0.1:57928], cmd: 7, error: cluster 'a' is not ready, current state is joining
[Mon Jan 8 10:53:03.775 2024] [1132] FATAL: 'a' cluster [127.0.0.1:57946], cmd: 7, error: cluster 'a' is not ready, current state is joining
root@9ade6f37ac5d:/# cat /var/log/manticore/searchd3.log
[Mon Jan 8 10:53:02.083 2024] [1167] watchdog: main process 1168 forked ok
[Mon Jan 8 10:53:02.084 2024] [1168] starting daemon version '6.2.13 ea9081bb3@240108 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204)' ...
[Mon Jan 8 10:53:02.085 2024] [1168] listening on 127.0.0.1:39312 for sphinx and http(s)
[Mon Jan 8 10:53:02.085 2024] [1168] listening on 127.0.0.1:39306 for mysql
[Mon Jan 8 10:53:02.085 2024] [1168] listening on 127.0.0.1:39308 for sphinx and http(s)
[Mon Jan 8 10:53:02.123 2024] [1173] prereading 0 tables
[Mon Jan 8 10:53:02.123 2024] [1173] preread 0 tables in 0.000 sec
[Mon Jan 8 10:53:02.125 2024] [1168] accepting connections
[Mon Jan 8 10:53:02.153 2024] [1176] [BUDDY] started v2.0.9 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://127.0.0.1:39312 --threads=12' at http://127.0.0.1:33159
[Mon Jan 8 10:53:02.153 2024] [1176] [BUDDY] Loaded plugins:
core: empty-string, backup, emulate-elastic, insert, alias, select, show, cli-table, plugin, test, insert-mva, create-table
local:
extra:
[Mon Jan 8 10:53:02.230 2024] [1173] WARNING: No persistent state found. Bootstraping with default state
[Mon Jan 8 10:53:03.243 2024] [1226] WARNING: Member 1.0 (node_127.0.0.1_a_1135) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
[Mon Jan 8 10:53:04.277 2024] [1180] WARNING: '127.0.0.1:29312': remote error: [127.0.0.1] cluster 'a' is not ready, current state is joining
I have modified the script and the issue still persist, the test fails due to some issue and still I can see fatals in log file, that probably we should not see at all due to blocking nature of cluster commands.
Here is the modified script to use.
pgrep -f searchd | xargs kill -9
rm -f /var/log/manticore/searchd*.log
for i in 1 2 3; do
lib_dir="/var/lib/manticore$i"
rm -fr $lib_dir
mkdir -p $lib_dir
cat > manticore$i.conf <<EOF
searchd {
listen = 127.0.0.1:${i}9312
listen = 127.0.0.1:${i}9306:mysql
listen = 127.0.0.1:${i}9308:http
log = /var/log/manticore/searchd${i}.log
query_log = /var/log/manticore/query${i}.log
pid_file = /var/run/manticore/searchd${i}.pid
data_dir = $lib_dir
}
EOF
searchd -c manticore$i.conf
done
mysql -h0 -P19306 -vv -e 'create cluster a'
mysql -h0 -P19306 -vv -e "create table t1; create table t2; create table t3;"
mysql -h0 -P19306 -vv -e "show tables"
mysql -h0 -P29306 -vv -e "join cluster a at '127.0.0.1:19312'" &
mysql -h0 -P19306 -vv -e "alter cluster a add t1" &
mysql -h0 -P39306 -vv -e "join cluster a at '127.0.0.1:19312'" &
mysql -h0 -P19306 -vv -e "alter cluster a add t2" &
mysql -h0 -P19306 -vv -e "alter cluster a add t3" &
wait 1>/dev/null 2>&1; echo $?
mysql -h0 -P19306 -vv -e "show status like '%_status'"
mysql -h0 -P19306 -vv -e "show status like '%_node_state'"
mysql -h0 -P19306 -vv -e "show status like '%_local_state_comment'"
mysql -h0 -P29306 -vv -e "show status like '%_status'"
mysql -h0 -P29306 -vv -e "show status like '%_node_state'"
mysql -h0 -P29306 -vv -e "show status like '%_local_state_comment'"
mysql -h0 -P39306 -vv -e "show status like '%_status'"
mysql -h0 -P39306 -vv -e "show status like '%_node_state'"
mysql -h0 -P39306 -vv -e "show status like '%_local_state_comment'"
cat /var/log/manticore/searchd*log | grep FATAL
I'm going to rename FATAL at the replication into WARNING that will have the same message with the error description but not the FATAL verbosity
ok I see that multiple join breaks cluster to update nodes well - investigating the fix
should be fixed at the current master https://github.com/manticoresoftware/manticoresearch/commit/a2af06ca36abcbf893e43c8803853130b397c54b. I fixed the warning and failure at the multiple nodes join the cluster at the same time
I attempted to use the most recent version of the test kit to run some tests, but it seems like it's still failing—not all the time, but quite frequently.
I've also noticed that it's now showing a different error. It looks like it can't connect to execute the MySQL query, which could mean the daemon is crashing.
tried to issue clt test at the dev box at the manticore/test/clt-tests
but it failed as there is no clt
folder or executable
stas@dev2:~/manticore/test/clt-tests$ clt/clt test -d -t test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
-bash: clt/clt: No such file or directory
127
need instruction of how to run the test at the dev box and how to run it with properd daemon binary
there is no clt folder or executable
Have you tried https://github.com/manticoresoftware/clt?tab=readme-ov-file#installation ?
no I will try it now
after following instructions clt I can run clt well. However it is still unclear where located create-table-5-cluster-60-shards.rec
Master branch does not have that test
stas@dev2:~/build/clt$ ./clt test -d -t ~/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
The record file does not exist: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec
1
there are a lot of different clt
and test
branches not sure what should I checkout.
It is also not clear how to change daemon cli
during the test run and how to use my own build daemon to investigate issue.
However it is still unclear where located create-table-5-cluster-60-shards.rec
https://github.com/manticoresoftware/manticoresearch/blob/tests/sharding/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec
It's in the tests/sharding
branch as was mentioned in this comment before: https://github.com/manticoresoftware/manticoresearch/issues/1485#issuecomment-1876757281
downloaded https://github.com/manticoresoftware/manticoresearch/blob/tests/sharding/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec into manticore\test\clt-tests\sharding\replication
and still see the error about missed rec file
stas@dev2:~/build/clt$ ./clt test -d -t ~/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
Replaying data from the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec
The replay result will be stored to the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rep
rec: No such file or directory (os error 2)
1
however the file exists
stas@dev2:~/build/clt$ ls -lh /home/stas/manticore/test/clt-tests/sharding/replication/ total 28K
-rw-rw-r-- 1 stas stas 12K Jan 16 08:57 create-table-5-cluster-60-shards.rec
-rw-rw-r-- 1 stas stas 2 Jan 16 08:59 create-table-5-cluster-60-shards.rep
-rw-rw-r-- 1 stas stas 584 Jan 9 10:58 init.recb
-rw-rw-r-- 1 stas stas 7.8K Jan 9 10:58 replication-test.rec
We need to clone the whole branch, because it requires extra files to run that (reusable blocks). Try to checkout the tests/sharding
branch and run the test with CLT same way
switched into tests/sharding
branch at the ~/manticore/
stas@dev2:~/manticore$ git rev-parse --abbrev-ref HEAD
sharding
stas@dev2:~/manticore$ git log -n1
commit a993c920fefe8beb2dd4011b69c3950ee7097909 (HEAD -> sharding)
Author: Don Hardman <[email protected]>
Date: Mon Jan 15 14:05:06 2024 +0700
Fix regex to validate versions with KNN included
then go into clt
folder that I checked out run there command ./clt test -d -t ~/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
and got the same error
stas@dev2:~/build/clt$ ./clt test -d -t ~/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
Replaying data from the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec
The replay result will be stored to the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rep
rec: No such file or directory (os error 2)
1
also tried to checkout clt into manticoresearch and use it from there
stas@dev2:~/manticore$ git clone https://github.com/manticoresoftware/clt.git
Cloning into 'clt'...
use also full paths but it returns another error
stas@dev2:~/manticore$ clt/clt test -d -t /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
Replaying data from the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec
The replay result will be stored to the file: /home/stas/manticore/test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rep
Directory with tests does not exist: tests
1
The test should be run exactly with the command that is specified in the body of this task. We should run it from the manticoresearch directory. Detailed instruction to run from the scratch:
git clone https://github.com/manticoresoftware/manticoresearch.git
cd manticoresearch
git checkout tests/sharding
git clone https://github.com/manticoresoftware/clt.git
clt/clt test -d -t test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
the initial error was that buddy create multiple clusters and one of the cluster has duplicated name and path or maybe buddy try to retry and create the same cluster then the same cluster operation is on the way. That was fixed at master - now daemon handles every create \ delete \ alter cluster operations one by one.
However test create-table-5-cluster-60-shards.rec
still produces error
create table c:t(id bigint) shards=60 rf=3 # error=Waiting timeout exceeded.
Not quite sure how investigate this further as I tried:
- setting different timeouts at the
searchd
that does not help
I also do not see any errors related to cluster at the daemon log.
I've resolved the major issue with the Buddy race condition that sometimes caused duplicate queries. However, I've noticed there's still a problem with some queries when we run tests now.
/var/log/manticore-1/searchd.log:[Mon Feb 5 17:50:27.376 2024] [471] [BUDDY] [127.0.0.1:19312] Queue query result [error]: {"error":"error at ToExecuteStart, code 3 (transaction aborted, server can continue), seqno -1"}
This issue occurs quite frequently, leading to sharding failures due to the error.
When validating the issue we can grep by "Queue" word to check the full log while Buddy run in debug mode, the script is the same:
git clone https://github.com/manticoresoftware/manticoresearch.git
cd manticoresearch
git checkout tests/sharding
git clone https://github.com/manticoresoftware/clt.git
clt/clt test -d -t test/clt-tests/sharding/replication/create-table-5-cluster-60-shards.rec ghcr.io/manticoresoftware/manticoresearch:test-kit-latest; echo $?
blocked by https://github.com/manticoresoftware/manticoresearch/issues/1802
should be fixed at https://github.com/manticoresoftware/manticoresearch/commit/b5fee2128154e506089e8d1e2200fe5038088785 need to make sure the clt test uses the last daemon binary
The issue continues and has become more frequent. Now, every time I start a new run, the same error happens.
––– input –––
mysql -h0 -P19306 -e "create table c:t(id bigint) shards=60 rf=3"
––– output –––
+ ERROR 1064 (42000) at line 1: Waiting timeout exceeded.
––– input –––
grep '\[error\]' /var/log/manticore-{1,2,3,4,5}/searchd.log
––– output –––
+ /var/log/manticore-1/searchd.log:[Mon Feb 19 03:44:09.335 2024] [467] [BUDDY] [127.0.0.1:19312] Queue query result [error]: {"error":"error at ToExecuteStart, code 3 (transaction aborted, server can continue), seqno -1"}
Is this a common error that we can handle, or is there still something wrong with concurrency?
Version used:
starting daemon version '6.2.13 56aaf1f55@24021713 dev (columnar 2.2.5 8c90c1f@240217) (secondary 2.2.5 8c90c1f@240217) (knn 2.2.5 8c90c1f@240217)' ...
should be fixed at https://github.com/manticoresoftware/manticoresearch/commit/da1d7dc198b8bff80ff62c020ebd0130d941a605 need to make sure the clt test uses the last daemon binary and after the fix I do not see errors any more. However test itself shows the differences