index-management icon indicating copy to clipboard operation
index-management copied to clipboard

Flaky tests

Open bowenlan-amzn opened this issue 3 years ago • 30 comments

Describe the bug

Note Please first search in this page for the same failure before recording yours

This issue is to record flaky tests we are seeing. Most of time, your PR is not the reason of the test failure. So you can just record the flaky test failure here and we will dig into it later.

Report like this:

Refer to this comment https://github.com/opensearch-project/index-management/issues/90#issuecomment-1378086112 You can download the log of failed run from the bottom of the summary of the workflow.

How to debug the flaky test

Flaky test is hard to understand at the first look, considering we may not have enough log. And it's also not easy to reproduce (race condition, environment dependent).

An efficient way to handle these:

  1. Go into the failed run, based on the test failures, you should be able to relate to the source code part.
  2. Download the log associated with the failed run, locating the logs in integTest.log when this test was run.
  3. If we don't have enough log to understand how it could happen, add more DEBUG log into the source code that would be enough for us to understand when this flaky test shows up next time. Change test log level to debugging like this. So we can see more info in the cluster log.
  4. You can also add log into the test code, so we can see more info in the test log.

Script to reproduce the flaky

This script works on Linux system

script ../flakyoutput.txt # use shell command script to record the terminal output into a file

for i in {1..10}; do
	echo "=== Run time $i ==="
	./gradlew clean && ./gradlew integTest --tests "*IndexManagementIndicesIT" --tests "*MetadataRegressionIT" --tests "*ActionRetryIT" -PnumNodes=3
	retVal=$?
	if [ $retVal -eq 1 ]; then
	    echo "Flaky test found in run time $i"
	    break
	fi
done

ctrl+d # end shell command script

bowenlan-amzn avatar Jun 30 '21 18:06 bowenlan-amzn

RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter

Workflow link

error
2021-08-02T17:31:54.8424115Z REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter" -Dtests.seed=10028AB61748958A -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=America/Vancouver -Druntime.java=14
2021-08-02T17:31:54.8429763Z org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT > test change policy API should only apply to indices in the state filter FAILED
2021-08-02T17:31:54.8432504Z     java.lang.AssertionError: expected:<false> but was:<true>
2021-08-02T17:31:54.8433481Z         at __randomizedtesting.SeedInfo.seed([10028AB61748958A:1A26BFFFDEAC54B6]:0)
2021-08-02T17:31:54.8434355Z         at org.junit.Assert.fail(Assert.java:89)
2021-08-02T17:31:54.8435156Z         at org.junit.Assert.failNotEquals(Assert.java:835)
2021-08-02T17:31:54.8436056Z         at org.junit.Assert.assertEquals(Assert.java:120)
2021-08-02T17:31:54.8437322Z         at org.junit.Assert.assertEquals(Assert.java:146)
2021-08-02T17:31:54.8441710Z         at org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.assertAffectedIndicesResponseIsEqual(IndexStateManagementRestTestCase.kt:741)
2021-08-02T17:31:54.8448231Z         at org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestChangePolicyActionIT.test change policy API should only apply to indices in the state filter(RestChangePolicyActionIT.kt:496)

bowenlan-amzn avatar Aug 02 '21 18:08 bowenlan-amzn

IndexManagementIndicesIT.test update management index mapping with new schema version

I saw this flakey failure on 3/5 multi node runs I did Workflow link

org.opensearch.indexmanagement.IndexManagementIndicesIT > test update management index mapping with new schema version FAILED
    org.opensearch.client.ResponseException: method [DELETE], host [http://[::1]:39473], URI [/.ds-test-data-stream-000001], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"}],"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"},"status":400}

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version" -Dtests.seed=BA2DF08B4E74969 -Dtests.security.manager=false -Dtests.locale=es-US -Dtests.timezone=Pacific/Niue -Druntime.java=14
        at __randomizedtesting.SeedInfo.seed([BA2DF08B4E74969:E2E0F3C6EB23FCC8]:0)
        at org.opensearch.client.RestClient.convertResponse(RestClient.java:339)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:309)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:283)
        at org.opensearch.indexmanagement.ODFERestTestCase.wipeAllODFEIndices(ODFERestTestCase.kt:139)
        at org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version(IndexManagementIndicesIT.kt:80)

downsrob avatar Oct 05 '21 18:10 downsrob

RollupInterceptorIT.test roll up search RollupInterceptorIT.test bucket and sub aggregations have correct values RollupRunnerIT.test metadata is created for rollup job when none exists RollupRunnerIT.test no-op execution when a full window of time to rollup is not available

Workflow link

org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT > test roll up search FAILED
    java.lang.AssertionError: Rollup job doesn't have metadata set
        at __randomizedtesting.SeedInfo.seed([687045F74F56AAF3:C7C066BB95AE41F9]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:91)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:51)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT.test roll up search(RollupInterceptorIT.kt:89)

TransformRunnerIT.test transform with data filter

Workflow link

errors
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=A19D930A096062ED -Dtests.security.manager=false -Dtests.locale=ru -Dtests.timezone=America/Recife -Druntime.java=14
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
    java.lang.AssertionError: Transform job doesn't have metadata set
        at __randomizedtesting.SeedInfo.seed([A19D930A096062ED:87A193E3B89B59E1]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:108)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:34)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:106)

downsrob avatar Nov 05 '21 16:11 downsrob

IndexStateManagementHistoryIT.test history shard settings

Workflow link

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings" -Dtests.seed=394C4E8F4FAA2B7B -Dtests.security.manager=false -Dtests.locale=mt -Dtests.timezone=Europe/Vaduz -Druntime.java=14
org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT > test history shard settings FAILED
    java.lang.AssertionError: Wrong number of shards expected:<2> but was:<1>
        at __randomizedtesting.SeedInfo.seed([394C4E8F4FAA2B7B:CA84DAB66688400A]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:452)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:50)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings(IndexStateManagementHistoryIT.kt:447)

  1> [2023-01-10T21:09:43,896][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] before test
  1> [2023-01-10T21:09:43,900][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] initializing REST clients against [http://[::1]:40303, http://127.0.0.1:35475]
  1> [2023-01-10T21:10:07,142][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] after test

logs.zip

downsrob avatar Nov 15 '21 17:11 downsrob

Time captured is 0 problem, if we see such problem, we can update the >0 to be >=0

 org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
    java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=49642B2182B7B4FC -Dtests.security.manager=false -Dtests.locale=hi-IN -Dtests.timezone=America/Indiana/Vincennes -Druntime.java=17
        at __randomizedtesting.SeedInfo.seed([49642B2182B7B4FC:BEAE0232E3E6D91]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
    java.lang.AssertionError: Doesn't capture indexed time
        at __randomizedtesting.SeedInfo.seed([E76B690D887C5E11:A5E5A20F24F5877C]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:68)

Test code ref, we should use different index name for different tests.

We saw a number of different flaky test failures in jenkins runs on security disabled domains for the 2.0.1 release which all seem to be related to capturing search time. Search time is checked after index time in these integration tests, and indexed into the metadata document at the same time. Perhaps the search result is cached from another test which is making the search time 0 on single node domains.

Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=C156C07169688EB4 -Dtests.security.manager=false -Dtests.locale=th -Dtests.timezone=Asia/Dubai -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([C156C07169688EB4:83D80B73C5E157D9]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:2FEAF0F5D4643658]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:108)
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:4B583B1EC916D439]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period" -Dtests.seed=5BC1719B69F3D965 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=ROK -Druntime.java=17
  2> java.lang.AssertionError: Did not spend time searching
        at __randomizedtesting.SeedInfo.seed([5BC1719B69F3D965:B377556363CB8BF6]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period(RollupRunnerIT.kt:730)

TransformRunnerIt.test transform with data filter

Added commit that comments the assertion: https://github.com/opensearch-project/index-management/pull/581

 java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=D58AE6EFD6EC0421 -Dtests.security.manager=false -Dtests.locale=es-BO -Dtests.timezone=Asia/Qostanay -Druntime.java=17

org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
    java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([D58AE6EFD6EC0421:F3B6E60667173F2D]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:111)

downsrob avatar Apr 07 '22 22:04 downsrob

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size" -Dtests.seed=7BB5E96EAB38AAE6 -Dtests.security.manager=false -Dtests.locale=cs-CZ -Dtests.timezone=America/Lower_Princes -Druntime.java=17
org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT > test rollover min primary shard size FAILED

    java.lang.AssertionError: Found multiple shards over 100kb

        at __randomizedtesting.SeedInfo.seed([7BB5E96EAB38AAE6:7EE92CD3306B0DB7]:0)
Suite: Test class org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT
        at org.junit.Assert.fail(Assert.java:89)
  2> Dub 18, 2022 7:37:30 ODP. org.opensearch.client.RestClient logResponse
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size(RolloverActionIT.kt:287)
  2> WARNING: request [PUT http://[::1]:42817/bwc_index-1] returned 1 warnings: [299 OpenSearch-2.0.0-SNAPSHOT-d94ccc2b76e1b6ea07481b30903a12038f5703fd "[index.opendistro.index_state_management.rollover_alias] setting was deprecated in OpenSearch and will be removed in a future release! See the breaking changes documentation for the next major version."]

Add more comments in this PR https://github.com/opensearch-project/index-management/pull/611

bowenlan-amzn avatar Apr 19 '22 01:04 bowenlan-amzn

java.lang.NullPointerException

RestStopRollupActionIT.test stopping a failed rollup

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11

Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11
  2> java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([25E38D5DE3AB3302:41F6E686D5846EF4]:0)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:125)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:123)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:123)

TransformRunnerIT.test continuous transform with wildcard indices

  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=4CB2963FF335C01A -Dtests.security.manager=false -Dtests.locale=zh-Hans-CN -Dtests.timezone=Etc/GMT+9 -Druntime.java=11
  2> java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([4CB2963FF335C01A:A6D458A846E559FD]:0)
        at java.base/java.time.Instant.compareTo(Instant.java:1254)
        at java.base/java.time.Instant.isAfter(Instant.java:1271)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:764)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:758)

https://github.com/opensearch-project/index-management/blob/1.x/src/test/kotlin/org/opensearch/indexmanagement/transform/TransformRunnerIT.kt#L764

--tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified" -Dtests.seed=5DCD6D7B8AAB3141 -Dtests.security.manager=false -Dtests.locale=es-MX -Dtests.timezone=Europe/Minsk -Druntime.java=11
 
java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([5DCD6D7B8AAB3141:6E4476B0B23793F7]:0)
        at java.base/java.time.Instant.compareTo(Instant.java:1254)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:428)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified(TransformRunnerIT.kt:424)

bowenlan-amzn avatar Apr 22 '22 22:04 bowenlan-amzn

Looking at some of the recent flaky failures in RestStopRollupActionIT, the flaky failures seemed to be caused by the metadata doc missing, but there weren't enough logs to tell what was going wrong. Next time there is a PR which fails due to this, we should add additional debug logs in the rollup runner and metadata writer. The rollup tests seem particularly flaky.

downsrob avatar May 11 '22 23:05 downsrob

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup" -Dtests.see d=3BC75D2F7F1C189E -Dtests.security.manager=false -Dtests.locale=
zh-Hans-SG -Dtests.timezone=Pacific/Fakaofo -Druntime.java=14
  2> java.lang.AssertionError: MetadataID on rollup was null
        at __randomizedtesting.SeedInfo.seed([3BC75D2F7F1C189E:78637077DCF0C390]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:109)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:107)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup(RestStartRollupActionIT.kt:107)

I had this fail on a multinode cluser and when looking into the logs I saw that it was because the ISM index moved nodes: On node 2:

[2022-05-12T21:44:04,036][INFO ][o.o.j.s.JobSweeper       ] [integTest-2] Descheduling jobs of shard 3 index .opendistro-ism-config as the shard is removed from this node.
[2022-05-12T21:44:04,333][INFO ][o.o.j.s.JobScheduler     ] [integTest-2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config .
[2022-05-12T21:44:04,469][INFO ][o.o.j.s.JobScheduler     ] [integTest-2] Descheduling jobId: restart_failed_rollup

On node 1:

[2022-05-12T21:44:04,543][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config 
.... this final descheduling was because the ism index was deleted during cleanup
[2022-05-12T21:44:36,179][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Descheduling jobId: restart_failed_rollup

It seems that some of these multinode flaky failures could be related to https://github.com/opensearch-project/job-scheduler/issues/173

downsrob avatar May 12 '22 21:05 downsrob

DeleteActionIT.test basic

There could be race condition to check history document after an index gets deleted.

https://github.com/opensearch-project/index-management/blob/main/src/test/kotlin/org/opensearch/indexmanagement/indexstatemanagement/action/DeleteActionIT.kt#L55

[2022-06-06T05:15:20,891][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:20,984][INFO ][o.o.j.s.JobSweeper       ] [integTest-0] Descheduling job p8ToeW8xSnaMY9Ju0bjRig on index .opendistro-ism-config
[2022-06-06T05:15:20,985][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Descheduling jobId: p8ToeW8xSnaMY9Ju0bjRig
[2022-06-06T05:15:20,994][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Finished executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:21,004][INFO ][o.o.j.s.JobSweeper       ] [integTest-0] Indexing failed for job p8ToeW8xSnaMY9Ju0bjRig#metadata on index .opendistro-ism-config
[2022-06-06T05:15:21,010][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] There was VersionConflictEngineException trying to update the metadata for deleteactionit_index_1. Message: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
org.opensearch.index.engine.VersionConflictEngineException: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
[2022-06-06T05:15:21,011][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] Failed to update ManagedIndexMetaData after executing the Step : attempt_delete

bowenlan-amzn avatar Jun 06 '22 21:06 bowenlan-amzn

https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED
    java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished>
        at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)

logs.zip


RestStartTransformActionIT.test starting a finished transform

Workflow link

error
2021-08-02T20:16:09.9041497Z REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform" -Dtests.seed=DD2C384C67FC30BD -Dtests.security.manager=false -Dtests.locale=uk-UA -Dtests.timezone=America/Campo_Grande -Druntime.java=14
2021-08-02T20:16:09.9050653Z 
2021-08-02T20:16:09.9828220Z org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT > test starting a finished transform FAILED
2021-08-02T20:16:09.9831015Z     java.lang.AssertionError: Status should be finished expected:<finished> but was:<started>
2021-08-02T20:16:09.9832385Z         at __randomizedtesting.SeedInfo.seed([DD2C384C67FC30BD:87A478982B41BA5]:0)
2021-08-02T20:16:09.9833459Z         at org.junit.Assert.fail(Assert.java:89)
2021-08-02T20:16:09.9834423Z         at org.junit.Assert.failNotEquals(Assert.java:835)
2021-08-02T20:16:09.9835510Z         at org.junit.Assert.assertEquals(Assert.java:120)
2021-08-02T20:16:09.9838406Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:196)
2021-08-02T20:16:09.9845749Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:32)
2021-08-02T20:16:09.9854345Z         at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:126)
2021-08-02T20:16:09.9857268Z         at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:120)
2021-08-02T20:16:09.9860591Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform(RestStartTransformActionIT.kt:193)
2021-08-02T20:16:17.8835315Z 

bowenlan-amzn avatar Sep 01 '22 23:09 bowenlan-amzn

test move metadata service java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>

Update: from the log, everything is working fine, but metadata move cannot find the metadata in the cluster state, I think this is more likely a core problem. And we will remove this test anyway soon.

org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT > test move metadata service FAILED
    java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>
        at __randomizedtesting.SeedInfo.seed([EBD0D8B7D135C4E0:49A3DDF6C040E639]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:92)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:91)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test move metadata service(MetadataRegressionIT.kt:91)

 1> [2022-09-08T00:32:56,025][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] before test
  1> [2022-09-08T00:32:56,028][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: setting up test
  1> [2022-09-08T00:32:56,084][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: all set up test
  1> [2022-09-08T00:32:56,140][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
  1> [2022-09-08T00:32:56,142][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
  1> [2022-09-08T00:32:56,197][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [0] to [-1]
  1> [2022-09-08T00:32:56,226][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [-1] to [0]
  1> [2022-09-08T00:32:56,631][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] true
  1> [2022-09-08T00:32:56,637][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] check if metadata is saved in cluster state: {index_uuid=9VHQYzkVT2y8JVczuACWQQ, rolled_over=false, policy_id=metadataregressionit_testPolicyName_1, policy_seq_no=0, index=metadataregressionit_index_1, policy_primary_term=1, state={"name":"ReplicaCountState","start_time":1234}, retry_info={"failed":false,"consumed_retries":0}, policy_completed=false, info={"message":"Happy moving"}}
  1> [2022-09-08T00:33:16,765][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
  1> [2022-09-08T00:33:16,766][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
  1> [2022-09-08T00:33:16,776][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaning up after test
  1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaned up after test
  1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] after test

logs.zip

bowenlan-amzn avatar Sep 09 '22 17:09 bowenlan-amzn

org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
    java.lang.AssertionError: Policy sequence numbers didn't match expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([59FF6DB8C59564EE:B1933B2EA2BBF760]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:55)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:35)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy(RestExplainSnapshotManagementIT.kt:35)

[2022-09-22T23:45:03,302][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] before test
[2022-09-22T23:46:13,587][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] after test

logs.zip

bowenlan-amzn avatar Sep 29 '22 06:09 bowenlan-amzn

org.opensearch.indexmanagement.transform.TransformRunnerIT > test continuous transforms with null buckets FAILED
    java.lang.AssertionError: Target transform index was not created
        at __randomizedtesting.SeedInfo.seed([BCDE8596F1D08F23:9C708CCB43625A8F]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transforms with null buckets$1.invoke(TransformRunnerIT.kt:898)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transforms with null buckets$1.invoke(TransformRunnerIT.kt:898)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transforms with null buckets(TransformRunnerIT.kt:898)

bowenlan-amzn avatar Nov 07 '22 23:11 bowenlan-amzn

test policy backwards compatibility FAILED

Regression from https://github.com/opensearch-project/OpenSearch/pull/4986

The ClusterManagerTaskThrottler validates during cluster setting update if all the nodes are above 2.4 on the cluster manager node. So during bwc rolling upgrade test, if the master node is on 2.4, it will fail this validation.

> Task :indexmanagementBwcCluster#oldVersionClusterTask1
> Task :indexmanagementBwcCluster#fullRestartClusterTask
> Task :indexmanagementBwcCluster#oldVersionClusterTask0
> Task :indexmanagementBwcCluster#mixedClusterTask


> Task :indexmanagementBwcCluster#twoThirdsUpgradedClusterTask

org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT > test policy backwards compatibility FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:37845/], URI [_cluster/settings], status line [HTTP/1.1 400 Bad Request]
REPRODUCE WITH: ./gradlew ':indexmanagementBwcCluster#twoThirdsUpgradedClusterTask' --tests "org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT.test policy backwards compatibility" -Dtests.seed=A21555258A35275A -Dtests.security.manager=true -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Australia/North -Druntime.java=11

Suite: Test class org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT
  2> REPRODUCE WITH: ./gradlew ':indexmanagementBwcCluster#twoThirdsUpgradedClusterTask' --tests "org.opensearch.indexmanagement.bwc.IndexManagementBackwardsCompatibilityIT.test policy backwards compatibility" -Dtests.seed=A21555258A35275A -Dtests.security.manager=true -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Australia/North -Druntime.java=11
  2> org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:37845/], URI [_cluster/settings], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]"}],"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]","caused_by":{"type":"illegal_argument_exception","reason":"All the nodes in cluster should be on version later than or equal to 2.4.0"}},"status":400}
        at __randomizedtesting.SeedInfo.seed([A21555258A35275A:E4E1527EA57ED803]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:109)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:94)
        at app//org.opensearch.indexmanagement.IndexManagementRestTestCase.setAutoCreateIndex(IndexManagementRestTestCase.kt:38)
        at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at [email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at [email protected]/java.lang.reflect.Method.invoke(Method.java:566)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]"}],"type":"illegal_argument_exception","reason":"illegal value can't update [cluster_manager.throttling.thresholds.] from [{}] to [{}]","caused_by":{"type":"illegal_argument_exception","reason":"All the nodes in cluster should be on version later than or equal to 2.4.0"}},"status":400}
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

bowenlan-amzn avatar Nov 08 '22 17:11 bowenlan-amzn

RollupRunnerIT.test rollup action with alias as target_index successfully

[2022-11-10T22:06:04,534][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] before test
[2022-11-10T22:06:08,891][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] after test

org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT > test rollup action with alias as target_index successfully FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:42643/], URI [/_plugins/_rollup/jobs/runner_with_alias_as_target?if_seq_no=9&if_primary_term=1], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"}],"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"},"status":409}
        at __randomizedtesting.SeedInfo.seed([7332C3AEA733CC1E:570293340F7A6A8C]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
        at app//org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test rollup action with alias as target_index(RollupRunnerIT.kt:871)

Need to understand rollup job document update logic more deeply to see why the race condition can happen.

bowenlan-amzn avatar Nov 10 '22 20:11 bowenlan-amzn

Continue to dive deep on this issue: https://github.com/opensearch-project/index-management/issues/90#issuecomment-1125457680

Some rollup job fails to run after 2s in multi-node tests, the related logs

587:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
603:[node 2] [.opendistro-ism-config] creating index
620:[node 0] JobSweeper started listening to operations on index .opendistro-ism-config
630:[node 0] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
631:[node 2] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
682:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
723:[node 0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config
734:[      ] send the update rollup start time request
775:[node 0] Descheduling jobId: restart_failed_rollup
782:[node 2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config

From the log, one reasonable hypothesis is the shard of config index on node 2 has not been updated to date (with the execute 2s later change) Added the refresh parameter for update start time call, but it doesn't resolve the problem.

It's not clear why the descheduling on node 0 and scheduling on the other node 2 happens, does it relate to the update start time call? Need more log in JS to find out the reason of this.

bowenlan-amzn avatar Nov 21 '22 23:11 bowenlan-amzn

https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED
    java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished>
        at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)

logs.zip

Job in test is defined as continuous=false so it could finish before test catches that STARTED status change in metadata. We can just make it continuous=true

petardz avatar Dec 12 '22 19:12 petardz

java.lang.NullPointerException

RestStopRollupActionIT.test stopping a failed rollup

When node disconnects(which happens a lot for some reason), job might stay blocked on API calls like _search for 30+ seconds(tested locally) and our waitFor { ... } in tests would timeout. I suggest we increase this default timeout for waitFor from 30sec to 90 or so

petardz avatar Dec 12 '22 19:12 petardz

RestDeleteRollupActionIT.test deleting a rollup

org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup FAILED
    org.opensearch.client.ResponseException: method [PUT], host [https://localhost:9200], URI [/_plugins/_rollup/jobs/GjkLiHNdTP?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"}],"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([19AD219FA033B002:35C712E4A8126070]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollupJson(RollupRestTestCase.kt:99)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup(RollupRestTestCase.kt:77)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup$default(RollupRestTestCase.kt:71)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup(RollupRestTestCase.kt:111)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup$default(RollupRestTestCase.kt:109)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup(RestDeleteRollupActionIT.kt:21)
org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup that doesn't exist in existing config index FAILED
    java.lang.AssertionError: expected:<NOT_FOUND> but was:<BAD_REQUEST>
        at __randomizedtesting.SeedInfo.seed([3FE7F518CA239133:249BA19E7AC77924]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:[120](https://github.com/opensearch-project/index-management/actions/runs/6115943950/job/16600369681#step:5:121))
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup that doesn't exist in existing config index(RestDeleteRollupActionIT.kt:37)

logs.zip


The flaky reason for such is because the rollup job, even after the config index deleted, the running cannot be interrupted but continue to still operate (saving some doc into config index) So the idea to fix this is by waiting for the rollup job to "truely" finish then do the next test operation.

  • for continuous mode test continuous rollup search How to fix: Can based on the processed documents number to decide whether this run is finished or the next start window, refer to https://github.com/opensearch-project/index-management/blob/4a6c92b6899d4115ad25df7f901ee58684c23187/src/test/kotlin/org/opensearch/indexmanagement/rollup/interceptor/RollupInterceptorIT.kt#L606-L611

  • for one-time mode, we wait for metadata status becomes FINISHED, but then the rollup job will disable itself, which will recreate the index and block the next job run https://github.com/opensearch-project/index-management/blob/cd99b0f39f2fd279d6f85161cef3b65e7adc6236/src/main/kotlin/org/opensearch/indexmanagement/rollup/RollupRunner.kt#L328-L332 How to fix: can based on whether the job is already disabled which is the last update call from one rollup jon For all the non continuous rollup job, should check like this https://github.com/opensearch-project/index-management/blob/dd50b3eb6d9869f9b4c1d286e8ab9c298dcb73f1/src/test/kotlin/org/opensearch/indexmanagement/rollup/runner/RollupRunnerIT.kt#L874-L875

bowenlan-amzn avatar Jan 03 '23 18:01 bowenlan-amzn

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://[::1]:41461], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"}],"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"},"status":409}
        at __randomizedtesting.SeedInfo.seed([4C03DDBDB5DE427F:396CA6272AA3E4F]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:354)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:329)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:110)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)

  1> [2023-07-12T18:49:25,309][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test
  1> [2023-07-12T18:49:28,596][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1689180510596
  1> [2023-07-12T18:49:32,665][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] There are still tasks running after this test that might break subsequent tests [indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index].
  1> [2023-07-12T18:49:32,666][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test

logs.zip

bowenlan-amzn avatar Jan 11 '23 00:01 bowenlan-amzn

org.opensearch.indexmanagement.transform.TransformRunnerIT > test continuous transform with wildcard indices FAILED
    java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
        at __randomizedtesting.SeedInfo.seed([BBAD4A6DC3D51A99:51CB84FA7605837E]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)

For now only seen on 1.3.8

logs.zip

bowenlan-amzn avatar Jan 31 '23 02:01 bowenlan-amzn

Suite: Test class org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT
2> REPRODUCE WITH: gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform" -Dtests.seed=3AE2878216465C0D -Dtests.security.manager=false -Dtests.locale=ar-SA -Dtests.timezone=PLT -Druntime.java=17
2> java.lang.AssertionError: expected:<{acknowledged=true}> but was:<{acknowledged=false}>
at __randomizedtesting.SeedInfo.seed([3AE2878216465C0D:BB536F5FF53592B4]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform(RestStopTransformActionIT.kt:45)

bowenlan-amzn avatar Feb 27 '23 18:02 bowenlan-amzn

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED java.lang.NullPointerException at __randomizedtesting.SeedInfo.seed([93A0B009A21A1943:F7B5DBD2943544B5]:0) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129) at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119) at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112) at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129) 1> [2023-05-11T22:33:50,096][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a stopped rollup] before test 4_Run integration tests with multi node config.txt

yuye-aws avatar May 12 '23 03:05 yuye-aws

Assertion failure -

Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=566DD28B92044ABE -Dtests.security.manager=false -Dtests.locale=de-CH -Dtests.timezone=Asia/Beirut -Druntime.java=8
  2> java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
        at __randomizedtesting.SeedInfo.seed([566DD28B92044ABE:BC0B1C1C27D4D359]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)

https://github.com/opensearch-project/index-management/pull/818#issuecomment-1603405480

getsaurabh02 avatar Jun 22 '23 23:06 getsaurabh02

org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
    java.lang.AssertionError: Policy creation trigger time didn't match expected:<1693475220000> but was:<1693475160000>
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy" -Dtests.seed=2D65846124E40D95 -Dtests.security.manager=false -Dtests.locale=ar-IQ -Dtests.timezone=Europe/Rome -Druntime.java=17
        at __randomizedtesting.SeedInfo.seed([2D65846124E40D95:C509D2F743CA9E1B]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at 

1693475220000 and 1693475160000 has 1m difference

https://github.com/opensearch-project/index-management/blob/1c9f2326179f5c27ac4220e3990724076dfdab29/src/test/kotlin/org/opensearch/indexmanagement/snapshotmanagement/resthandler/RestExplainSnapshotManagementIT.kt#L48

This line should be move forward right after updateSMPolicyStartTime(smPolicy)

bowenlan-amzn avatar Aug 31 '23 16:08 bowenlan-amzn

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:40201/], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"}],"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"},"status":409}
        at __randomizedtesting.SeedInfo.seed([4C04B53ACE4A7D12:391A2E5093E0122]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:[110](https://github.com/opensearch-project/index-management/actions/runs/6087082313/job/16529399791?pr=917#step:5:111))
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)

  1> [2023-09-06T04:56:09,984][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test

  1> [2023-09-06T04:56:09,987][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] initializing REST clients against [http://[::1]:34255, http://127.0.0.1:40201]/
Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
  1> [2023-09-06T04:56:12,771][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1693968914771
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
  1> [2023-09-06T04:56:15,073][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test

logs.zip

bowenlan-amzn avatar Sep 06 '23 04:09 bowenlan-amzn

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=[60](https://github.com/opensearch-project/index-management/actions/runs/6087082323/job/16529957685?pr=917#step:4:61)268BB86926DC53 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=Etc/GMT+6 -Druntime.java=17

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED
    java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([60268BB86926DC53:433E0635F0981A5]:0)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:125)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129)

  1> [2023-09-05T21:42:00,471][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] before test
  1> [2023-09-05T21:42:00,685][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] Update rollup start time to 1693971662685
  1> [2023-09-05T21:42:21,926][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] There are still tasks running after this test that might break subsequent tests [indices:admin/seq_no/retention_lease_sync[p], indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, internal:index/shard/recovery/start_recovery, retention_lease_sync].
  1> [2023-09-05T21:42:21,927][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] after test

logs.zip

bowenlan-amzn avatar Sep 06 '23 04:09 bowenlan-amzn

org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index not failed FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:36939/], URI [/_plugins/_ism/policies/nKcdTokvYV?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([277DE7DD2C0FC9DA:DCD1D3F9DB703133]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:385)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:355)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:330)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6092475260/job/16530644871?pr=919#step:5:108))
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
        at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index not failed(RestRetryFailedManagedIndexActionIT.kt:181)

  1> [2023-09-05T21:10:30,893][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
  1> [2023-09-05T21:10:46,956][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
  1> [2023-09-05T21:10:46,961][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] before test
  1> [2023-09-05T21:10:47,549][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] after test
  1> [2023-09-05T21:10:47,573][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] before test
  1> [2023-09-05T21:10:54,254][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] after test

logs (1).zip

org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index has no metadata FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:43057], URI [/_plugins/_ism/policies/VQXByzgvXd?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([9D243BEA443D0FB7:81B369160383AE27]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6116145222/job/16600899736#step:5:108))
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
        at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index has no metadata(RestRetryFailedManagedIndexActionIT.kt:156)

 1> [2023-09-07T20:58:23,281][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
  1> [2023-09-07T20:58:38,825][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
  1> [2023-09-07T20:58:38,832][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] before test
  1> [2023-09-07T20:58:38,980][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] after test

logs.zip

bowenlan-amzn avatar Sep 06 '23 04:09 bowenlan-amzn

There are situations multi node test running very long, 5-10x of normal run time. One such situation I checked is one node is stuck with problematic node lock.

This happens when running the test test allocate exclude, and when the index .opendistro-ism-config gets deleted at the end of the test

org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)

It never be able to join back the cluster also...

[2023-10-04T17:08:53,839][WARN ][o.o.e.NodeEnvironment    ] [integTest-1] lock assertion failed
java.nio.channels.ClosedByInterruptException: null
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199) ~[?:?]
	at sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:171) ~[?:?]
	at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:395) ~[?:?]
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:169) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.availableShardPaths(NodeEnvironment.java:949) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.deleteShardDirectoryUnderLock(NodeEnvironment.java:580) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.deleteShardStore(IndicesService.java:1210) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService.onShardClose(IndexService.java:670) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:793) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:780) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store.closeInternal(Store.java:566) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store$1.closeInternal(Store.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:78) [opensearch-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store.decRef(Store.java:539) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1775) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1746) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.refresh(IndexShard.java:1303) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.lambda$buildRefreshListeners$44(IndexShard.java:4325) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.RefreshListeners.forceRefreshes(RefreshListeners.java:128) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.relocated(IndexShard.java:830) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$27(RecoverySourceHandler.java:850) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:124) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.CancellableThreads.execute(CancellableThreads.java:104) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$28(RecoverySourceHandler.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1493) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-10-04T17:08:53,844][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-1] [[.opendistro-ism-config/svhpmccOR-ighAH7RXtFkA]] failed to complete pending deletion for index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:165) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.indexPaths(NodeEnvironment.java:932) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.deleteIndexDirectoryUnderLock(NodeEnvironment.java:656) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.processPendingDeletes(IndicesService.java:1488) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService$1.doRun(IndicesClusterStateService.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

logs.zip

bowenlan-amzn avatar Oct 04 '23 20:10 bowenlan-amzn