citus
citus copied to clipboard
Flaky test counts between March 20 to April 20
This issue aims to capture the current state of the flaky tests, and coordinate fixes.
I created a summary on the failures on master branch since last 30 days. See the test name followed by the number of failures in the last 30 days in parentheses
Regression tests:
- [x]
distributed_triggers
(17) #5894 #5896 - [x]
global_cancel
(6) #5948 - [ ]
rollback_to_savepoint
(5)
Isolation tests:
- [x]
isolation_replicate_reference_tables_to_coordinator
(21) #5900 - [x]
isolation_select_vs_all_on_mx
(8) #5910 #5939 - [x]
isolation_drop_alter_index_select_for_update_on_mx
(4) #5910 #5939 - [x]
isolation_reference_copy_vs_all
(1) #5910 #5939 - [x]
isolation_hash_copy_vs_all
(1) #5910 #5939 - [x]
isolation_master_update_node_1
(1) #5913
Failure tests:
- [ ]
failure_setup
(5) - [ ]
failure_insert_select_repartition
(5) - [ ]
failure_single_select
(3) - [ ]
failure_connection_establishment
(2) - [ ]
failure_create_distributed_table_non_empty
(1)
Scripts:
- [ ]
check-merge-to-enterprise
(6)
tableam exist in https://github.com/citusdata/citus/issues/5569 , but not here, is it resolved?
tableam exist in #5569 , but not here, is it resolved?
I guess we did not see any flakiness in this time period for that test. Either it was fixed somehow, or we were quite lucky. I can not really tell which one it is :)
tableam exist in #5569 , but not here, is it resolved?
I guess we did not see any flakiness in this time period for that test. Either it was fixed somehow, or we were quite lucky. I can not really tell which one it is :)
RESET client_min_messages;
delete from test_ref;
WARNING: fake_scan_getnextslot
DETAIL: from localhost:57637
+WARNING: fake_scan_getnextslot
+DETAIL: from localhost:57638
ERROR: fake_tuple_delete not implemented
CONTEXT: while executing command on localhost:57638
The citus version we are using is 10.2,after analyze it, I think the failure of this is may because of concurrent execution of delete in datanode. the sql ( delete from test_ref) is send to 57637 first ,then 57638 , but 57637 execute it slow , then cn will receive 'DETAIL: from localhost:57638' before 'ERROR: fake_tuple_delete not implemented' msg, if 57637 execute it fast , **cn will receive ' ERROR: fake_tuple_delete not implemented' msg before 'DETAIL: from localhost:57638' msg **, then sql ended (because of error) without receive 'DETAIL: from localhost:57638' msg.
It may be helpful to force sequential execution on tableam
to get deterministic outputs. We should get the error message right after running the query on the first shard.
SET citus.multi_shard_modify_mode TO 'sequential';
We use this trick in several places in the test suite.
Probably not applicable anymore as we don't see any more occurrences of them.