nearcore icon indicating copy to clipboard operation
nearcore copied to clipboard

chunks_produced_and_distributed_2_vals_per_shard might be flaky

Open matklad opened this issue 3 years ago • 5 comments

https://buildkite.com/nearprotocol/nearcore/builds/16730#018119ad-1238-4dea-92c9-e9b08e5a1de6/6-4512

And I think I've seen it fail couple of times locally

matklad avatar May 31 '22 10:05 matklad

Failed the second time around today:

https://buildkite.com/nearprotocol/nearcore/builds/16762#01811afd-d2c1-458c-b8ed-68eb0eb2874a/6-3826

matklad avatar May 31 '22 17:05 matklad

Failed again, bumping to high, flakes are bad

matklad avatar Jun 01 '22 12:06 matklad

Some progress: it seems that this is some test interaction problem. Specifically, I can only reproduce this when I run both network and client tests. My current best repro is:

  1. Apply this diff:
diff --git a/integration-tests/src/tests/client/mod.rs b/integration-tests/src/tests/client/mod.rs
index bc9acdfa8..9fb1feb1c 100644
--- a/integration-tests/src/tests/client/mod.rs
+++ b/integration-tests/src/tests/client/mod.rs
@@ -1,9 +1,9 @@
-mod challenges;
+// mod challenges;
 mod chunks_management;
-mod process_blocks;
-mod runtimes;
-#[cfg(feature = "sandbox")]
-mod sandbox;
-mod sharding_upgrade;
-#[cfg(feature = "test_features")]
-mod shards_manager;
+// mod process_blocks;
+// mod runtimes;
+// #[cfg(feature = "sandbox")]
+// mod sandbox;
+// mod sharding_upgrade;
+// #[cfg(feature = "test_features")]
+// mod shards_manager;
diff --git a/integration-tests/src/tests/mod.rs b/integration-tests/src/tests/mod.rs
index a59caa24e..6ad301989 100644
--- a/integration-tests/src/tests/mod.rs
+++ b/integration-tests/src/tests/mod.rs
@@ -1,10 +1,10 @@
 mod client;
-mod nearcore;
+// mod nearcore;
 mod network;
-mod runtime;
-mod standard_cases;
-mod test_catchup;
-mod test_errors;
-mod test_overflows;
-mod test_simple;
-mod test_tps_regression;
+// mod runtime;
+// mod standard_cases;
+// mod test_catchup;
+// mod test_errors;
+// mod test_overflows;
+// mod test_simple;
+// mod test_tps_regression;
  1. run n 100 cargo t --features nightly -p integration-tests --lib in one terminal
  2. in a seprate terminal, hammer CPU with unrelated rust compilation.

matklad avatar Jun 01 '22 13:06 matklad

Plot twist -- it seems that maybe its just the network tests which fail, and then chunks_produced_and_distributed_2_vals_per_shard somehow fails due to the panic? (we do some funky stuff with panics in run_actix)

matklad avatar Jun 01 '22 14:06 matklad

Indeed! If I comment out network tests, this test no longer fails. So I thing this and https://github.com/near/nearcore/issues/6935 is exactly the same issue. Still not sure why...

matklad avatar Jun 01 '22 15:06 matklad

Could be relevant - chunks_produced_and_distributed_one_val_shard_cop is flaky as well for a while. Could be somehow relevant to flat storage introduction.

Some evidence:

  • https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/Flaky.20chunks_produced_and_distributed_2_vals_per_shard/near/307396948
  • https://buildkite.com/nearprotocol/nearcore/builds/23507#018539c8-23de-452e-88be-f8b3fb0cbf64

Longarithm avatar Dec 22 '22 12:12 Longarithm

The solution was to introduce HEAVY_TESTS_LOCK mark these tests as expensive_tests which guarantees that all network messages are delivered in time. More long term solution is to write such tests in anti-flaky TestLoop.

Longarithm avatar Jun 14 '24 18:06 Longarithm