OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[CI] o.o.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery

Open nknize opened this issue 3 years ago • 4 comments

Failed on unrelated PR #1742. Not reproducible locally. Opening to track if this continues to fail.

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery" -Dtests.seed=1183E5842BAA4635 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=ja-JP -Dtests.timezone=Pacific/Kwajalein -Druntime.java=17
org.opensearch.gateway.RecoveryFromGatewayIT > testReuseInFileBasedPeerRecovery FAILED
    java.lang.AssertionError: shard [test][0] on node [node_t1] has pending operations:
     --> RetentionLeaseBackgroundSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=1468, leases={peer_recovery/_23_6236SQekFQ4X2S5HWQ=RetentionLease{id='peer_recovery/_23_6236SQekFQ4X2S5HWQ', retainingSequenceNumber=1333, timestamp=1639665151286, source='peer recovery'}, peer_recovery/txItgQoGQoyJSR_SvgGAIQ=RetentionLease{id='peer_recovery/txItgQoGQoyJSR_SvgGAIQ', retainingSequenceNumber=1333, timestamp=1639665151286, source='peer recovery'}}}, shardId=[test][0], timeout=1m, index='test', waitForActiveShards=0}
    	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:248)
    	at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3231)
    	at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1117)
    	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:434)
    	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
    	at org.opensearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:378)
    	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:91)
    	at org.opensearch.transport.TransportService$8.doRun(TransportService.java:944)
    	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
    	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    	at java.base/java.lang.Thread.run(Thread.java:833)
        at __randomizedtesting.SeedInfo.seed([1183E5842BAA4635:471B0010ABF29B6E]:0)
        at org.opensearch.test.InternalTestCluster.lambda$assertNoPendingIndexOperations$12(InternalTestCluster.java:1434)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1060)
        at org.opensearch.test.InternalTestCluster.assertNoPendingIndexOperations(InternalTestCluster.java:1421)
        at org.opensearch.test.InternalTestCluster.beforeIndexDeletion(InternalTestCluster.java:1349)
        at org.opensearch.test.OpenSearchIntegTestCase.beforeIndexDeletion(OpenSearchIntegTestCase.java:636)

nknize avatar Dec 16 '21 17:12 nknize

Another occurrence : PR 2026

Gradle log

dreamer-89 avatar Feb 09 '22 15:02 dreamer-89

https://github.com/opensearch-project/OpenSearch/pull/2069#issuecomment-1032857800

dblock avatar Feb 14 '22 18:02 dblock

It doesn't look like we've referenced this flakey test failure after April. But that said, I could not find any explicit fixes for this test that would suggest that this issue has been resolved. Should we close this issue and assume the issue has fixed itself along the way?

kartg avatar Aug 02 '22 22:08 kartg

The real friends are the tests we made along the way ;)

I'm for shooting it and seeing if it reappears, but interested to hear what other folks think.

CEHENKLE avatar Aug 03 '22 00:08 CEHENKLE

Ran this test 1000 times in isolation, was not able to reproduce. Closing as there have been no occurrences since April

./gradlew ':server:internalClusterTest' --tests "org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery" -Dtests.seed=1183E5842BAA4635 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=ja-JP -Dtests.timezone=Pacific/Kwajalein -Dtests.iters=1000 

> Configure project :qa:os
Cannot add task 'destructiveDistroTest.docker' as a task with that name already exists.
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.6
  OS Info               : Linux 5.4.225-139.416.amzn2int.x86_64 (amd64)
  JDK Version           : 17 (OpenJDK)
  JAVA_HOME             : /opt/jdk-17
  Random Testing Seed   : 1183E5842BAA4635
  In FIPS 140 mode      : false
=======================================

> Task :server:internalClusterTest
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/local/home/jpalis/repos/flaky-tests/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/local/home/jpalis/.gradle/wrapper/dists/gradle-7.6-all/9f832ih6bniajn45pbmqhk2cw/gradle-7.6/lib/plugins/gradle-testing-base-7.6.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 17m 56s
43 actionable tasks: 1 executed, 42 up-to-date

joshpalis avatar Dec 28 '22 22:12 joshpalis