citus icon indicating copy to clipboard operation
citus copied to clipboard

Flaky test counts between March 20 to April 20

Open hanefi opened this issue 2 years ago • 4 comments

This issue aims to capture the current state of the flaky tests, and coordinate fixes.

I created a summary on the failures on master branch since last 30 days. See the test name followed by the number of failures in the last 30 days in parentheses

Regression tests:

  • [x] distributed_triggers (17) #5894 #5896
  • [x] global_cancel (6) #5948
  • [ ] rollback_to_savepoint (5)

Isolation tests:

  • [x] isolation_replicate_reference_tables_to_coordinator (21) #5900
  • [x] isolation_select_vs_all_on_mx (8) #5910 #5939
  • [x] isolation_drop_alter_index_select_for_update_on_mx (4) #5910 #5939
  • [x] isolation_reference_copy_vs_all (1) #5910 #5939
  • [x] isolation_hash_copy_vs_all (1) #5910 #5939
  • [x] isolation_master_update_node_1 (1) #5913

Failure tests:

  • [ ] failure_setup (5)
  • [ ] failure_insert_select_repartition (5)
  • [ ] failure_single_select (3)
  • [ ] failure_connection_establishment (2)
  • [ ] failure_create_distributed_table_non_empty (1)

Scripts:

  • [ ] check-merge-to-enterprise (6)

hanefi avatar Apr 21 '22 01:04 hanefi

tableam exist in https://github.com/citusdata/citus/issues/5569 , but not here, is it resolved?

cstarc avatar Aug 02 '22 09:08 cstarc

tableam exist in #5569 , but not here, is it resolved?

I guess we did not see any flakiness in this time period for that test. Either it was fixed somehow, or we were quite lucky. I can not really tell which one it is :)

hanefi avatar Aug 03 '22 15:08 hanefi

tableam exist in #5569 , but not here, is it resolved?

I guess we did not see any flakiness in this time period for that test. Either it was fixed somehow, or we were quite lucky. I can not really tell which one it is :)

 RESET client_min_messages;
 delete from test_ref;
 WARNING:  fake_scan_getnextslot
 DETAIL:  from localhost:57637
+WARNING:  fake_scan_getnextslot
+DETAIL:  from localhost:57638
 ERROR:  fake_tuple_delete not implemented
 CONTEXT:  while executing command on localhost:57638

The citus version we are using is 10.2,after analyze it, I think the failure of this is may because of concurrent execution of delete in datanode. the sql ( delete from test_ref) is send to 57637 first ,then 57638 , but 57637 execute it slow , then cn will receive 'DETAIL: from localhost:57638' before 'ERROR: fake_tuple_delete not implemented' msg, if 57637 execute it fast , **cn will receive ' ERROR: fake_tuple_delete not implemented' msg before 'DETAIL: from localhost:57638' msg **, then sql ended (because of error) without receive 'DETAIL: from localhost:57638' msg.

cstarc avatar Aug 05 '22 03:08 cstarc

It may be helpful to force sequential execution on tableam to get deterministic outputs. We should get the error message right after running the query on the first shard.

SET citus.multi_shard_modify_mode TO 'sequential';

We use this trick in several places in the test suite.

hanefi avatar Aug 11 '22 23:08 hanefi

Probably not applicable anymore as we don't see any more occurrences of them.

onurctirtir avatar Nov 24 '23 09:11 onurctirtir