index-management
index-management copied to clipboard
Flaky tests
Describe the bug
Note Please first search in this page for the same failure before recording yours
This issue is to record flaky tests we are seeing. Most of time, your PR is not the reason of the test failure. So you can just record the flaky test failure here and we will dig into it later.
Report like this:
Refer to this comment https://github.com/opensearch-project/index-management/issues/90#issuecomment-1378086112 You can download the log of failed run from the bottom of the summary of the workflow.
How to debug the flaky test
Flaky test is hard to understand at the first look, considering we may not have enough log. And it's also not easy to reproduce (race condition, environment dependent).
An efficient way to handle these:
- Go into the failed run, based on the test failures, you should be able to relate to the source code part.
- Download the log associated with the failed run, locating the logs in
integTest.log
when this test was run. - If we don't have enough log to understand how it could happen, add more DEBUG log into the source code that would be enough for us to understand when this flaky test shows up next time. Change test log level to debugging like this. So we can see more info in the cluster log.
- You can also add log into the test code, so we can see more info in the test log.
Script to reproduce the flaky
This script works on Linux system
script ../flakyoutput.txt # use shell command script to record the terminal output into a file
for i in {1..10}; do
echo "=== Run time $i ==="
./gradlew clean && ./gradlew integTest --tests "*IndexManagementIndicesIT" --tests "*MetadataRegressionIT" --tests "*ActionRetryIT" -PnumNodes=3
retVal=$?
if [ $retVal -eq 1 ]; then
echo "Flaky test found in run time $i"
break
fi
done
ctrl+d # end shell command script
RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter
error
2021-08-02T17:31:54.8424115Z REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter" -Dtests.seed=10028AB61748958A -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=America/Vancouver -Druntime.java=14
2021-08-02T17:31:54.8429763Z org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT > test change policy API should only apply to indices in the state filter FAILED
2021-08-02T17:31:54.8432504Z java.lang.AssertionError: expected:<false> but was:<true>
2021-08-02T17:31:54.8433481Z at __randomizedtesting.SeedInfo.seed([10028AB61748958A:1A26BFFFDEAC54B6]:0)
2021-08-02T17:31:54.8434355Z at org.junit.Assert.fail(Assert.java:89)
2021-08-02T17:31:54.8435156Z at org.junit.Assert.failNotEquals(Assert.java:835)
2021-08-02T17:31:54.8436056Z at org.junit.Assert.assertEquals(Assert.java:120)
2021-08-02T17:31:54.8437322Z at org.junit.Assert.assertEquals(Assert.java:146)
2021-08-02T17:31:54.8441710Z at org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.assertAffectedIndicesResponseIsEqual(IndexStateManagementRestTestCase.kt:741)
2021-08-02T17:31:54.8448231Z at org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter(RestChangePolicyActionIT.kt:496)
IndexManagementIndicesIT.test update management index mapping with new schema version
I saw this flakey failure on 3/5 multi node runs I did Workflow link
org.opensearch.indexmanagement.IndexManagementIndicesIT > test update management index mapping with new schema version FAILED
org.opensearch.client.ResponseException: method [DELETE], host [http://[::1]:39473], URI [/.ds-test-data-stream-000001], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"}],"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"},"status":400}
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version" -Dtests.seed=BA2DF08B4E74969 -Dtests.security.manager=false -Dtests.locale=es-US -Dtests.timezone=Pacific/Niue -Druntime.java=14
at __randomizedtesting.SeedInfo.seed([BA2DF08B4E74969:E2E0F3C6EB23FCC8]:0)
at org.opensearch.client.RestClient.convertResponse(RestClient.java:339)
at org.opensearch.client.RestClient.performRequest(RestClient.java:309)
at org.opensearch.client.RestClient.performRequest(RestClient.java:283)
at org.opensearch.indexmanagement.ODFERestTestCase.wipeAllODFEIndices(ODFERestTestCase.kt:139)
at org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version(IndexManagementIndicesIT.kt:80)
RollupInterceptorIT.test roll up search
RollupInterceptorIT.test bucket and sub aggregations have correct values
RollupRunnerIT.test metadata is created for rollup job when none exists
RollupRunnerIT.test no-op execution when a full window of time to rollup is not available
org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT > test roll up search FAILED
java.lang.AssertionError: Rollup job doesn't have metadata set
at __randomizedtesting.SeedInfo.seed([687045F74F56AAF3:C7C066BB95AE41F9]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertNotNull(Assert.java:713)
at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:91)
at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:51)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT.test roll up search(RollupInterceptorIT.kt:89)
TransformRunnerIT.test transform with data filter
errors
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=A19D930A096062ED -Dtests.security.manager=false -Dtests.locale=ru -Dtests.timezone=America/Recife -Druntime.java=14
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
java.lang.AssertionError: Transform job doesn't have metadata set
at __randomizedtesting.SeedInfo.seed([A19D930A096062ED:87A193E3B89B59E1]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertNotNull(Assert.java:713)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:108)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:34)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:106)
IndexStateManagementHistoryIT.test history shard settings
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings" -Dtests.seed=394C4E8F4FAA2B7B -Dtests.security.manager=false -Dtests.locale=mt -Dtests.timezone=Europe/Vaduz -Druntime.java=14
org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT > test history shard settings FAILED
java.lang.AssertionError: Wrong number of shards expected:<2> but was:<1>
at __randomizedtesting.SeedInfo.seed([394C4E8F4FAA2B7B:CA84DAB66688400A]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:452)
at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:50)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings(IndexStateManagementHistoryIT.kt:447)
1> [2023-01-10T21:09:43,896][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] before test
1> [2023-01-10T21:09:43,900][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] initializing REST clients against [http://[::1]:40303, http://127.0.0.1:35475]
1> [2023-01-10T21:10:07,142][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] after test
Time captured is 0 problem, if we see such problem, we can update the >0
to be >=0
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=49642B2182B7B4FC -Dtests.security.manager=false -Dtests.locale=hi-IN -Dtests.timezone=America/Indiana/Vincennes -Druntime.java=17
at __randomizedtesting.SeedInfo.seed([49642B2182B7B4FC:BEAE0232E3E6D91]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
java.lang.AssertionError: Doesn't capture indexed time
at __randomizedtesting.SeedInfo.seed([E76B690D887C5E11:A5E5A20F24F5877C]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:68)
Test code ref, we should use different index name for different tests.
We saw a number of different flaky test failures in jenkins runs on security disabled domains for the 2.0.1 release which all seem to be related to capturing search time. Search time is checked after index time in these integration tests, and indexed into the metadata document at the same time. Perhaps the search result is cached from another test which is making the search time 0 on single node domains.
Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=C156C07169688EB4 -Dtests.security.manager=false -Dtests.locale=th -Dtests.timezone=Asia/Dubai -Druntime.java=17
2> java.lang.AssertionError: Didn't capture search time
at __randomizedtesting.SeedInfo.seed([C156C07169688EB4:83D80B73C5E157D9]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
2> java.lang.AssertionError: Didn't capture search time
at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:2FEAF0F5D4643658]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:108)
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
2> java.lang.AssertionError: Didn't capture search time
at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:4B583B1EC916D439]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period" -Dtests.seed=5BC1719B69F3D965 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=ROK -Druntime.java=17
2> java.lang.AssertionError: Did not spend time searching
at __randomizedtesting.SeedInfo.seed([5BC1719B69F3D965:B377556363CB8BF6]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period(RollupRunnerIT.kt:730)
TransformRunnerIt.test transform with data filter
Added commit that comments the assertion: https://github.com/opensearch-project/index-management/pull/581
java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=D58AE6EFD6EC0421 -Dtests.security.manager=false -Dtests.locale=es-BO -Dtests.timezone=Asia/Qostanay -Druntime.java=17
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
java.lang.AssertionError: Didn't capture search time
at __randomizedtesting.SeedInfo.seed([D58AE6EFD6EC0421:F3B6E60667173F2D]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:111)
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size" -Dtests.seed=7BB5E96EAB38AAE6 -Dtests.security.manager=false -Dtests.locale=cs-CZ -Dtests.timezone=America/Lower_Princes -Druntime.java=17
org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT > test rollover min primary shard size FAILED
java.lang.AssertionError: Found multiple shards over 100kb
at __randomizedtesting.SeedInfo.seed([7BB5E96EAB38AAE6:7EE92CD3306B0DB7]:0)
Suite: Test class org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT
at org.junit.Assert.fail(Assert.java:89)
2> Dub 18, 2022 7:37:30 ODP. org.opensearch.client.RestClient logResponse
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size(RolloverActionIT.kt:287)
2> WARNING: request [PUT http://[::1]:42817/bwc_index-1] returned 1 warnings: [299 OpenSearch-2.0.0-SNAPSHOT-d94ccc2b76e1b6ea07481b30903a12038f5703fd "[index.opendistro.index_state_management.rollover_alias] setting was deprecated in OpenSearch and will be removed in a future release! See the breaking changes documentation for the next major version."]
Add more comments in this PR https://github.com/opensearch-project/index-management/pull/611
java.lang.NullPointerException
RestStopRollupActionIT.test stopping a failed rollup
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11
Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11
2> java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([25E38D5DE3AB3302:41F6E686D5846EF4]:0)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:125)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:123)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:123)
TransformRunnerIT.test continuous transform with wildcard indices
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=4CB2963FF335C01A -Dtests.security.manager=false -Dtests.locale=zh-Hans-CN -Dtests.timezone=Etc/GMT+9 -Druntime.java=11
2> java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([4CB2963FF335C01A:A6D458A846E559FD]:0)
at java.base/java.time.Instant.compareTo(Instant.java:1254)
at java.base/java.time.Instant.isAfter(Instant.java:1271)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:764)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:758)
https://github.com/opensearch-project/index-management/blob/1.x/src/test/kotlin/org/opensearch/indexmanagement/transform/TransformRunnerIT.kt#L764
--tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified" -Dtests.seed=5DCD6D7B8AAB3141 -Dtests.security.manager=false -Dtests.locale=es-MX -Dtests.timezone=Europe/Minsk -Druntime.java=11
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([5DCD6D7B8AAB3141:6E4476B0B23793F7]:0)
at java.base/java.time.Instant.compareTo(Instant.java:1254)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:428)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified(TransformRunnerIT.kt:424)
Looking at some of the recent flaky failures in RestStopRollupActionIT, the flaky failures seemed to be caused by the metadata doc missing, but there weren't enough logs to tell what was going wrong. Next time there is a PR which fails due to this, we should add additional debug logs in the rollup runner and metadata writer. The rollup tests seem particularly flaky.
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup" -Dtests.see d=3BC75D2F7F1C189E -Dtests.security.manager=false -Dtests.locale=
zh-Hans-SG -Dtests.timezone=Pacific/Fakaofo -Druntime.java=14
2> java.lang.AssertionError: MetadataID on rollup was null
at __randomizedtesting.SeedInfo.seed([3BC75D2F7F1C189E:78637077DCF0C390]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertNotNull(Assert.java:713)
at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:109)
at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:107)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup(RestStartRollupActionIT.kt:107)
I had this fail on a multinode cluser and when looking into the logs I saw that it was because the ISM index moved nodes: On node 2:
[2022-05-12T21:44:04,036][INFO ][o.o.j.s.JobSweeper ] [integTest-2] Descheduling jobs of shard 3 index .opendistro-ism-config as the shard is removed from this node.
[2022-05-12T21:44:04,333][INFO ][o.o.j.s.JobScheduler ] [integTest-2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config .
[2022-05-12T21:44:04,469][INFO ][o.o.j.s.JobScheduler ] [integTest-2] Descheduling jobId: restart_failed_rollup
On node 1:
[2022-05-12T21:44:04,543][INFO ][o.o.j.s.JobScheduler ] [integTest-0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config
.... this final descheduling was because the ism index was deleted during cleanup
[2022-05-12T21:44:36,179][INFO ][o.o.j.s.JobScheduler ] [integTest-0] Descheduling jobId: restart_failed_rollup
It seems that some of these multinode flaky failures could be related to https://github.com/opensearch-project/job-scheduler/issues/173
DeleteActionIT.test basic
There could be race condition to check history document after an index gets deleted.
https://github.com/opensearch-project/index-management/blob/main/src/test/kotlin/org/opensearch/indexmanagement/indexstatemanagement/action/DeleteActionIT.kt#L55
[2022-06-06T05:15:20,891][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:20,984][INFO ][o.o.j.s.JobSweeper ] [integTest-0] Descheduling job p8ToeW8xSnaMY9Ju0bjRig on index .opendistro-ism-config
[2022-06-06T05:15:20,985][INFO ][o.o.j.s.JobScheduler ] [integTest-0] Descheduling jobId: p8ToeW8xSnaMY9Ju0bjRig
[2022-06-06T05:15:20,994][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Finished executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:21,004][INFO ][o.o.j.s.JobSweeper ] [integTest-0] Indexing failed for job p8ToeW8xSnaMY9Ju0bjRig#metadata on index .opendistro-ism-config
[2022-06-06T05:15:21,010][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] There was VersionConflictEngineException trying to update the metadata for deleteactionit_index_1. Message: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
org.opensearch.index.engine.VersionConflictEngineException: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
[2022-06-06T05:15:21,011][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] Failed to update ManagedIndexMetaData after executing the Step : attempt_delete
https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED
java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished>
at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)
RestStartTransformActionIT.test starting a finished transform
error
2021-08-02T20:16:09.9041497Z REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform" -Dtests.seed=DD2C384C67FC30BD -Dtests.security.manager=false -Dtests.locale=uk-UA -Dtests.timezone=America/Campo_Grande -Druntime.java=14
2021-08-02T20:16:09.9050653Z
2021-08-02T20:16:09.9828220Z org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT > test starting a finished transform FAILED
2021-08-02T20:16:09.9831015Z java.lang.AssertionError: Status should be finished expected:<finished> but was:<started>
2021-08-02T20:16:09.9832385Z at __randomizedtesting.SeedInfo.seed([DD2C384C67FC30BD:87A478982B41BA5]:0)
2021-08-02T20:16:09.9833459Z at org.junit.Assert.fail(Assert.java:89)
2021-08-02T20:16:09.9834423Z at org.junit.Assert.failNotEquals(Assert.java:835)
2021-08-02T20:16:09.9835510Z at org.junit.Assert.assertEquals(Assert.java:120)
2021-08-02T20:16:09.9838406Z at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:196)
2021-08-02T20:16:09.9845749Z at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:32)
2021-08-02T20:16:09.9854345Z at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:126)
2021-08-02T20:16:09.9857268Z at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:120)
2021-08-02T20:16:09.9860591Z at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform(RestStartTransformActionIT.kt:193)
2021-08-02T20:16:17.8835315Z
test move metadata service
java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>
Update: from the log, everything is working fine, but metadata move cannot find the metadata in the cluster state, I think this is more likely a core problem. And we will remove this test anyway soon.
org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT > test move metadata service FAILED
java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>
at __randomizedtesting.SeedInfo.seed([EBD0D8B7D135C4E0:49A3DDF6C040E639]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:92)
at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:91)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test move metadata service(MetadataRegressionIT.kt:91)
1> [2022-09-08T00:32:56,025][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] before test
1> [2022-09-08T00:32:56,028][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: setting up test
1> [2022-09-08T00:32:56,084][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: all set up test
1> [2022-09-08T00:32:56,140][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
1> [2022-09-08T00:32:56,142][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
1> [2022-09-08T00:32:56,197][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [0] to [-1]
1> [2022-09-08T00:32:56,226][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [-1] to [0]
1> [2022-09-08T00:32:56,631][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] true
1> [2022-09-08T00:32:56,637][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] check if metadata is saved in cluster state: {index_uuid=9VHQYzkVT2y8JVczuACWQQ, rolled_over=false, policy_id=metadataregressionit_testPolicyName_1, policy_seq_no=0, index=metadataregressionit_index_1, policy_primary_term=1, state={"name":"ReplicaCountState","start_time":1234}, retry_info={"failed":false,"consumed_retries":0}, policy_completed=false, info={"message":"Happy moving"}}
1> [2022-09-08T00:33:16,765][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
1> [2022-09-08T00:33:16,766][INFO ][o.o.c.s.ClusterSettings ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
1> [2022-09-08T00:33:16,776][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaning up after test
1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaned up after test
1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] after test
org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
java.lang.AssertionError: Policy sequence numbers didn't match expected:<1> but was:<0>
at __randomizedtesting.SeedInfo.seed([59FF6DB8C59564EE:B1933B2EA2BBF760]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:55)
at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:35)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy(RestExplainSnapshotManagementIT.kt:35)
[2022-09-22T23:45:03,302][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] before test
[2022-09-22T23:46:13,587][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] after test
org.opensearch.indexmanagement.transform.TransformRunnerIT > test continuous transforms with null buckets FAILED
java.lang.AssertionError: Target transform index was not created
at __randomizedtesting.SeedInfo.seed([BCDE8596F1D08F23:9C708CCB43625A8F]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transforms with null buckets$1.invoke(TransformRunnerIT.kt:898)
at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transforms with null buckets$1.invoke(TransformRunnerIT.kt:898)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transforms with null buckets(TransformRunnerIT.kt:898)
test policy backwards compatibility FAILED
Regression from https://github.com/opensearch-project/OpenSearch/pull/4986
The ClusterManagerTaskThrottler
validates during cluster setting update if all the nodes are above 2.4 on the cluster manager node. So during bwc rolling upgrade test, if the master node is on 2.4, it will fail this validation.
> Task :indexmanagementBwcCluster#oldVersionClusterTask1
> Task :indexmanagementBwcCluster#fullRestartClusterTask
> Task :indexmanagementBwcCluster#oldVersionClusterTask0
> Task :indexmanagementBwcCluster#mixedClusterTask
> Task :indexmanagementBwcCluster#twoThirdsUpgradedClusterTask
org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT > test policy backwards compatibility FAILED
org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:37845/], URI [_cluster/settings], status line [HTTP/1.1 400 Bad Request]
REPRODUCE WITH: ./gradlew ':indexmanagementBwcCluster#twoThirdsUpgradedClusterTask' --tests "org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT.test policy backwards compatibility" -Dtests.seed=A21555258A35275A -Dtests.security.manager=true -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Australia/North -Druntime.java=11
Suite: Test class org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT
2> REPRODUCE WITH: ./gradlew ':indexmanagementBwcCluster#twoThirdsUpgradedClusterTask' --tests "org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT.test policy backwards compatibility" -Dtests.seed=A21555258A35275A -Dtests.security.manager=true -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Australia/North -Druntime.java=11
2> org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:37845/], URI [_cluster/settings], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]"}],"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]","caused_by":{"type":"illegal_argument_exception","reason":"All the nodes in cluster should be on version later than or equal to 2.4.0"}},"status":400}
at __randomizedtesting.SeedInfo.seed([A21555258A35275A:E4E1527EA57ED803]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:109)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:94)
at app//org.opensearch.indexmanagement.IndexManagementRestTestCase.setAutoCreateIndex(IndexManagementRestTestCase.kt:38)
at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at [email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at [email protected]/java.lang.reflect.Method.invoke(Method.java:566)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]"}],"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]","caused_by":{"type":"illegal_argument_exception","reason":"All the nodes in cluster should be on version later than or equal to 2.4.0"}},"status":400}
at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at [email protected]/java.lang.Thread.run(Thread.java:829)
RollupRunnerIT.test rollup action with alias as target_index successfully
[2022-11-10T22:06:04,534][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] before test
[2022-11-10T22:06:08,891][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] after test
org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT > test rollup action with alias as target_index successfully FAILED
org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:42643/], URI [/_plugins/_rollup/jobs/runner_with_alias_as_target?if_seq_no=9&if_primary_term=1], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"}],"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"},"status":409}
at __randomizedtesting.SeedInfo.seed([7332C3AEA733CC1E:570293340F7A6A8C]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
at app//org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test rollup action with alias as target_index(RollupRunnerIT.kt:871)
Need to understand rollup job document update logic more deeply to see why the race condition can happen.
Continue to dive deep on this issue: https://github.com/opensearch-project/index-management/issues/90#issuecomment-1125457680
Some rollup job fails to run after 2s in multi-node tests, the related logs
587:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
603:[node 2] [.opendistro-ism-config] creating index
620:[node 0] JobSweeper started listening to operations on index .opendistro-ism-config
630:[node 0] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
631:[node 2] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
682:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
723:[node 0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config
734:[ ] send the update rollup start time request
775:[node 0] Descheduling jobId: restart_failed_rollup
782:[node 2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config
From the log, one reasonable hypothesis is the shard of config index on node 2 has not been updated to date (with the execute 2s later change) Added the refresh parameter for update start time call, but it doesn't resolve the problem.
It's not clear why the descheduling on node 0 and scheduling on the other node 2 happens, does it relate to the update start time call? Need more log in JS to find out the reason of this.
https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished> at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:120) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222) at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119) at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)
Job in test is defined as continuous=false so it could finish before test catches that STARTED status change in metadata. We can just make it continuous=true
java.lang.NullPointerException
RestStopRollupActionIT.test stopping a failed rollup
When node disconnects(which happens a lot for some reason), job might stay blocked on API calls like _search for 30+ seconds(tested locally) and our waitFor { ... } in tests would timeout. I suggest we increase this default timeout for waitFor from 30sec to 90 or so
RestDeleteRollupActionIT.test deleting a rollup
org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup FAILED
org.opensearch.client.ResponseException: method [PUT], host [https://localhost:9200], URI [/_plugins/_rollup/jobs/GjkLiHNdTP?refresh=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"}],"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"},"status":400}
at __randomizedtesting.SeedInfo.seed([19AD219FA033B002:35C712E4A8126070]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollupJson(RollupRestTestCase.kt:99)
at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup(RollupRestTestCase.kt:77)
at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup$default(RollupRestTestCase.kt:71)
at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup(RollupRestTestCase.kt:111)
at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup$default(RollupRestTestCase.kt:109)
at app//org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup(RestDeleteRollupActionIT.kt:21)
org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup that doesn't exist in existing config index FAILED
java.lang.AssertionError: expected:<NOT_FOUND> but was:<BAD_REQUEST>
at __randomizedtesting.SeedInfo.seed([3FE7F518CA239133:249BA19E7AC77924]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:[120](https://github.com/opensearch-project/index-management/actions/runs/6115943950/job/16600369681#step:5:121))
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup that doesn't exist in existing config index(RestDeleteRollupActionIT.kt:37)
The flaky reason for such is because the rollup job, even after the config index deleted, the running cannot be interrupted but continue to still operate (saving some doc into config index) So the idea to fix this is by waiting for the rollup job to "truely" finish then do the next test operation.
-
for continuous mode
test continuous rollup search
How to fix: Can based on the processed documents number to decide whether this run is finished or the next start window, refer to https://github.com/opensearch-project/index-management/blob/4a6c92b6899d4115ad25df7f901ee58684c23187/src/test/kotlin/org/opensearch/indexmanagement/rollup/interceptor/RollupInterceptorIT.kt#L606-L611 -
for one-time mode, we wait for metadata status becomes FINISHED, but then the rollup job will disable itself, which will recreate the index and block the next job run https://github.com/opensearch-project/index-management/blob/cd99b0f39f2fd279d6f85161cef3b65e7adc6236/src/main/kotlin/org/opensearch/indexmanagement/rollup/RollupRunner.kt#L328-L332 How to fix: can based on whether the job is already disabled which is the last update call from one rollup jon For all the non continuous rollup job, should check like this https://github.com/opensearch-project/index-management/blob/dd50b3eb6d9869f9b4c1d286e8ab9c298dcb73f1/src/test/kotlin/org/opensearch/indexmanagement/rollup/runner/RollupRunnerIT.kt#L874-L875
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
org.opensearch.client.ResponseException: method [POST], host [http://[::1]:41461], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"}],"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"},"status":409}
at __randomizedtesting.SeedInfo.seed([4C03DDBDB5DE427F:396CA6272AA3E4F]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:384)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:354)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:329)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:110)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)
1> [2023-07-12T18:49:25,309][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test
1> [2023-07-12T18:49:28,596][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1689180510596
1> [2023-07-12T18:49:32,665][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] There are still tasks running after this test that might break subsequent tests [indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index].
1> [2023-07-12T18:49:32,666][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test
org.opensearch.indexmanagement.transform.TransformRunnerIT > test continuous transform with wildcard indices FAILED
java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
at __randomizedtesting.SeedInfo.seed([BBAD4A6DC3D51A99:51CB84FA7605837E]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)
For now only seen on 1.3.8
Suite: Test class org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT
2> REPRODUCE WITH: gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform" -Dtests.seed=3AE2878216465C0D -Dtests.security.manager=false -Dtests.locale=ar-SA -Dtests.timezone=PLT -Druntime.java=17
2> java.lang.AssertionError: expected:<{acknowledged=true}> but was:<{acknowledged=false}>
at __randomizedtesting.SeedInfo.seed([3AE2878216465C0D:BB536F5FF53592B4]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform(RestStopTransformActionIT.kt:45)
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED java.lang.NullPointerException at __randomizedtesting.SeedInfo.seed([93A0B009A21A1943:F7B5DBD2943544B5]:0) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129) at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119) at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129) 1> [2023-05-11T22:33:50,096][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a stopped rollup] before test 4_Run integration tests with multi node config.txt
Assertion failure -
Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=566DD28B92044ABE -Dtests.security.manager=false -Dtests.locale=de-CH -Dtests.timezone=Asia/Beirut -Druntime.java=8
2> java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
at __randomizedtesting.SeedInfo.seed([566DD28B92044ABE:BC0B1C1C27D4D359]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)
https://github.com/opensearch-project/index-management/pull/818#issuecomment-1603405480
org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
java.lang.AssertionError: Policy creation trigger time didn't match expected:<1693475220000> but was:<1693475160000>
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy" -Dtests.seed=2D65846124E40D95 -Dtests.security.manager=false -Dtests.locale=ar-IQ -Dtests.timezone=Europe/Rome -Druntime.java=17
at __randomizedtesting.SeedInfo.seed([2D65846124E40D95:C509D2F743CA9E1B]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at
1693475220000 and 1693475160000 has 1m difference
https://github.com/opensearch-project/index-management/blob/1c9f2326179f5c27ac4220e3990724076dfdab29/src/test/kotlin/org/opensearch/indexmanagement/snapshotmanagement/resthandler/RestExplainSnapshotManagementIT.kt#L48
This line should be move forward right after updateSMPolicyStartTime(smPolicy)
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:40201/], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"}],"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"},"status":409}
at __randomizedtesting.SeedInfo.seed([4C04B53ACE4A7D12:391A2E5093E0122]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:[110](https://github.com/opensearch-project/index-management/actions/runs/6087082313/job/16529399791?pr=917#step:5:111))
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)
1> [2023-09-06T04:56:09,984][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test
1> [2023-09-06T04:56:09,987][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] initializing REST clients against [http://[::1]:34255, http://127.0.0.1:40201]/
Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
1> [2023-09-06T04:56:12,771][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1693968914771
2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
1> [2023-09-06T04:56:15,073][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=[60](https://github.com/opensearch-project/index-management/actions/runs/6087082323/job/16529957685?pr=917#step:4:61)268BB86926DC53 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([60268BB86926DC53:433E0635F0981A5]:0)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:125)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129)
1> [2023-09-05T21:42:00,471][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] before test
1> [2023-09-05T21:42:00,685][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] Update rollup start time to 1693971662685
1> [2023-09-05T21:42:21,926][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] There are still tasks running after this test that might break subsequent tests [indices:admin/seq_no/retention_lease_sync[p], indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, internal:index/shard/recovery/start_recovery, retention_lease_sync].
1> [2023-09-05T21:42:21,927][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] after test
org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index not failed FAILED
org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:36939/], URI [/_plugins/_ism/policies/nKcdTokvYV?refresh=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
at __randomizedtesting.SeedInfo.seed([277DE7DD2C0FC9DA:DCD1D3F9DB703133]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:385)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:355)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:330)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6092475260/job/16530644871?pr=919#step:5:108))
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index not failed(RestRetryFailedManagedIndexActionIT.kt:181)
1> [2023-09-05T21:10:30,893][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
1> [2023-09-05T21:10:46,956][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
1> [2023-09-05T21:10:46,961][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] before test
1> [2023-09-05T21:10:47,549][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] after test
1> [2023-09-05T21:10:47,573][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] before test
1> [2023-09-05T21:10:54,254][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] after test
org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index has no metadata FAILED
org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:43057], URI [/_plugins/_ism/policies/VQXByzgvXd?refresh=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
at __randomizedtesting.SeedInfo.seed([9D243BEA443D0FB7:81B369160383AE27]:0)
at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6116145222/job/16600899736#step:5:108))
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index has no metadata(RestRetryFailedManagedIndexActionIT.kt:156)
1> [2023-09-07T20:58:23,281][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
1> [2023-09-07T20:58:38,825][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
1> [2023-09-07T20:58:38,832][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] before test
1> [2023-09-07T20:58:38,980][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] after test
There are situations multi node test running very long, 5-10x of normal run time. One such situation I checked is one node is stuck with problematic node lock.
This happens when running the test test allocate exclude
, and when the index .opendistro-ism-config
gets deleted at the end of the test
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)
It never be able to join back the cluster also...
[2023-10-04T17:08:53,839][WARN ][o.o.e.NodeEnvironment ] [integTest-1] lock assertion failed
java.nio.channels.ClosedByInterruptException: null
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199) ~[?:?]
at sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:171) ~[?:?]
at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:395) ~[?:?]
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:169) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.env.NodeEnvironment.availableShardPaths(NodeEnvironment.java:949) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.env.NodeEnvironment.deleteShardDirectoryUnderLock(NodeEnvironment.java:580) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.IndicesService.deleteShardStore(IndicesService.java:1210) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.IndexService.onShardClose(IndexService.java:670) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:793) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:780) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.store.Store.closeInternal(Store.java:566) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.store.Store$1.closeInternal(Store.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:78) [opensearch-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.store.Store.decRef(Store.java:539) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1775) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1746) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.shard.IndexShard.refresh(IndexShard.java:1303) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.shard.IndexShard.lambda$buildRefreshListeners$44(IndexShard.java:4325) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.shard.RefreshListeners.forceRefreshes(RefreshListeners.java:128) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.index.shard.IndexShard.relocated(IndexShard.java:830) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$27(RecoverySourceHandler.java:850) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:124) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.CancellableThreads.execute(CancellableThreads.java:104) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$28(RecoverySourceHandler.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1493) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-10-04T17:08:53,844][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-1] [[.opendistro-ism-config/svhpmccOR-ighAH7RXtFkA]] failed to complete pending deletion for index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:165) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.env.NodeEnvironment.indexPaths(NodeEnvironment.java:932) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.env.NodeEnvironment.deleteIndexDirectoryUnderLock(NodeEnvironment.java:656) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.IndicesService.processPendingDeletes(IndicesService.java:1488) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.cluster.IndicesClusterStateService$1.doRun(IndicesClusterStateService.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]