ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[BUG] Tag Mismatch error on VisualizationsToolIT.testVisualizationFound Windows Test

Open brianf-aws opened this issue 10 months ago • 12 comments

What is the bug? https://github.com/opensearch-project/ml-commons/blob/85d0c9e2b8807162de9afe7c915801b75e486064/plugin/src/test/java/org/opensearch/ml/tools/VisualizationsToolIT.java#L59-L66 There is a retry enabled on the VisualizationsToolIT.testVisualizationFound test but it seems that retry has a bit of a flaw if the underlying problem is different I am seeing that the problem here is a encryption issue. This might be the source of all of our flaky tests

VisualizationsToolIT > testVisualizationFound FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:54529/], URI [/_plugins/_ml/agents/bLjaRJQB515KRnslfdUv/_execute], status line [HTTP/1.1 500 Internal Server Error]
    {"status":500,"error":{"type":"AEADBadTagException","reason":"System Error","details":"Tag mismatch"}}
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:501)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:359)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:182)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:155)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:144)
        at app//org.opensearch.ml.tools.VisualizationsToolIT.testVisualizationFound(VisualizationsToolIT.java:74)

    java.lang.AssertionError: The response failed to meet condition after 5 attempts. Attempted to perform GET : /_plugins/_ml/models/arjaRJQB515KRnsleNWv
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.waitResponseMeetingCondition(ToolIntegrationWithLLMTest.java:103)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.checkForModelUndeployedStatus(ToolIntegrationWithLLMTest.java:89)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.deleteModel(ToolIntegrationWithLLMTest.java:74)
        at

... 

2> REPRODUCE WITH: gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.tools.VisualizationsToolIT.testVisualizationFound" -Dtests.seed=AD7A0603B7C68274 -Dtests.security.manager=false -Dtests.locale=fr-GN -Dtests.timezone=America/Argentina/Buenos_Aires -Druntime.java=21
  2> org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:54529/], URI [/_plugins/_ml/agents/bLjaRJQB515KRnslfdUv/_execute], status line [HTTP/1.1 500 Internal Server Error]
    {"status":500,"error":{"type":"AEADBadTagException","reason":"System Error","details":"Tag mismatch"}}
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:501)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:359)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:182)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:155)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:144)
        at app//org.opensearch.ml.tools.VisualizationsToolIT.testVisualizationFound(VisualizationsToolIT.java:74)

    java.lang.AssertionError: The response failed to meet condition after 5 attempts. Attempted to perform GET : /_plugins/_ml/models/arjaRJQB515KRnsleNWv
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.waitResponseMeetingCondition(ToolIntegrationWithLLMTest.java:103)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.checkForModelUndeployedStatus(ToolIntegrationWithLLMTest.java:89)
        at org.opensearch.ml.tools.ToolIntegrationWithLLMTest.deleteModel(ToolIntegrationWithLLMTest.java:74)
        at

How can one reproduce the bug? This was discovered in a build failure.

What is the expected behavior? This test should pass or timeout but not have this encryption issue.

brianf-aws avatar Jan 08 '25 20:01 brianf-aws

seeing VisualizationIT failing again but with different error. https://github.com/opensearch-project/ml-commons/pull/3353

mingshl avatar Jan 09 '25 16:01 mingshl

seeing VisualizationIT failing again but with different error. #3353

Hmm This is confusing I would think that the retry would help but like you said here it didnt help. Its clearly failing even when the retries are according to how many nodes there are. If only there was some way to dump all possible info and configuration when this happens

brianf-aws avatar Jan 09 '25 18:01 brianf-aws

Hey @Hailong-am do you mind taking a look? Thanks

brianf-aws avatar Jan 14 '25 19:01 brianf-aws

Catch All Triage - 1, 2, 3

krisfreedain avatar Jan 27 '25 17:01 krisfreedain

Hey @Hailong-am do you mind taking a look? Thanks

do you have the link or the logs for this failure?

Hailong-am avatar Jan 28 '25 07:01 Hailong-am

Hey Hailong, we are trying to get to paste the stack traces with reproduction line too. Thankfully this build failure log didn't expire. Can you take a look?

brianf-aws avatar Jan 28 '25 19:01 brianf-aws

Adding the log here in txt format so it doesn't expire

6_Build and Test MLCommons Plugin on linux (21).txt

Here is another example of another build failure. Linking the txt file here as well to make sure it does not expire.

7_Build and Test MLCommons Plugin on linux (21).txt

brianf-aws avatar Jan 28 '25 19:01 brianf-aws

@Hailong-am did you get a chance to look at this? any update?

pyek-bot avatar Feb 11 '25 18:02 pyek-bot

@Hailong-am did you get a chance to look at this? any update?

by looking the logs attached [testVisualizationNotFound] The 6-th attempt on GET:/_plugins/_ml/models/UNT-S5QBXi7OW4I7mZRp . response: Response{requestLine=GET /_plugins/_ml/models/UNT-S5QBXi7OW4I7mZRp HTTP/1.1, host=http://[::1]:38269, response=HTTP/1.1 200 OK}

Tag mismatch error happened at model deploy phrase which is not get model api. so i assume Tag mismatch error is not the cause of the flaky test for this time.

we may need add some logs to see what's the actual response body for get model api

Suppressed: javax.crypto.AEADBadTagException: Tag mismatch
2025-01-09T16:56:16.8490050Z »  		at java.base/com.sun.crypto.provider.GaloisCounterMode$GCMDecrypt.doFinal(GaloisCounterMode.java:1545) ~[?:?]
2025-01-09T16:56:16.8491643Z »  		at java.base/com.sun.crypto.provider.GaloisCounterMode.engineDoFinal(GaloisCounterMode.java:417) ~[?:?]
2025-01-09T16:56:16.8492770Z »  		at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2244) ~[?:?]
2025-01-09T16:56:16.8494108Z »  		at com.amazonaws.encryptionsdk.internal.JceKeyCipher.decryptKey(JceKeyCipher.java:129) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8495801Z »  		at com.amazonaws.encryptionsdk.jce.JceMasterKey.decryptDataKey(JceMasterKey.java:165) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8497882Z »  		at com.amazonaws.encryptionsdk.DefaultCryptoMaterialsManager.decryptMaterials(DefaultCryptoMaterialsManager.java:118) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8500076Z »  		at com.amazonaws.encryptionsdk.internal.DecryptionHandler.readHeaderFields(DecryptionHandler.java:621) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8502066Z »  		at com.amazonaws.encryptionsdk.internal.DecryptionHandler.<init>(DecryptionHandler.java:111) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8503830Z »  		at com.amazonaws.encryptionsdk.internal.DecryptionHandler.create(DecryptionHandler.java:302) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8505549Z »  		at com.amazonaws.encryptionsdk.AwsCrypto.decryptData(AwsCrypto.java:511) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8507014Z »  		at com.amazonaws.encryptionsdk.AwsCrypto.decryptData(AwsCrypto.java:502) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8508505Z »  		at com.amazonaws.encryptionsdk.AwsCrypto.decryptData(AwsCrypto.java:476) ~[aws-encryption-sdk-java-2.4.1.jar:?]
2025-01-09T16:56:16.8510156Z »  		at org.opensearch.ml.engine.encryptor.EncryptorImpl.decrypt(EncryptorImpl.java:97) ~[opensearch-ml-algorithms-2.19.0.0-SNAPSHOT.jar:?]
2025-01-09T16:56:16.8512230Z »  		at org.opensearch.ml.engine.algorithms.remote.RemoteModel.lambda$initModel$0(RemoteModel.java:104) ~[opensearch-ml-algorithms-2.19.0.0-SNAPSHOT.jar:?]
2025-01-09T16:56:16.8514467Z »  		at org.opensearch.ml.common.connector.HttpConnector.decrypt(HttpConnector.java:366) ~[opensearch-ml-common-2.19.0.0-SNAPSHOT.jar:?]
2025-01-09T16:56:16.8516333Z »  		at org.opensearch.ml.engine.algorithms.remote.RemoteModel.initModel(RemoteModel.java:104) [opensearch-ml-algorithms-2.19.0.0-SNAPSHOT.jar:?]
2025-01-09T16:56:16.8518000Z »  		at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:139) [opensearch-ml-algorithms-2.19.0.0-SNAPSHOT.jar:?]
2025-01-09T16:56:16.8519744Z »  		```

Hailong-am avatar Feb 12 '25 10:02 Hailong-am

Thanks for the update! Would you be willing to take that up? (adding logs)

pyek-bot avatar Feb 12 '25 18:02 pyek-bot

Thanks for the update! Would you be willing to take that up? (adding logs)

sure, i will do two things. First add some logs to log response body, second continuing try in my local to see whether i can reproduce the error.

Hailong-am avatar Feb 13 '25 06:02 Hailong-am

Thank you!

pyek-bot avatar Feb 16 '25 21:02 pyek-bot

@Hailong-am Do you have any update on this? Can I assign this issue to you?

dhrubo-os avatar Jul 01 '25 17:07 dhrubo-os

@Hailong-am Do you have any update on this? Can I assign this issue to you?

The logs has been added, do we still face this issue? If not we can close it and open a new one with latest github action run logs

Hailong-am avatar Jul 02 '25 01:07 Hailong-am

Close it, please reopen if we still face the issue

Hailong-am avatar Jul 11 '25 11:07 Hailong-am