ml-commons
ml-commons copied to clipboard
[BUG] Race Condition when running multiple integ tests invoking EncryptorImpl.encrypt early
What is the bug?
What we have is a race condition:
- When the first test starts it creates a (local) ML Config index (which isn't needed)
- When the first test finishes it issues a rest call in the
@Afterof the superclass to delete the (local) index. This is answered by the primary shard (giving a successful request) but the cluster state isn't instantly updated.- The second test starts and checks the cluster state and thinks the index is there, but then it has to check the index mapping and by that time it has been deleted and results in the error.
The error probably exists in all the other ML Commons integ tests, but the unique part about our tests is that several start out immediately creating a connector and calling
encrypt(). Let me try throwing a refresh call at the start of every test method to ensure cluster state is caught up and see if it fixes....
Originally posted by @dbwiddis in https://github.com/opensearch-project/ml-commons/pull/2818#discussion_r1744345431
How can one reproduce the bug? Steps to reproduce the behavior:
- Write integration tests which extend
MLCommonsRestTestCase - Have multiple tests, as their first step, create a connector (or anything else that invokes
EncryptorImpl.encrypt().) - Execute the tests filtered to ensure that tests creating the
.plugins-ml-configindex execute sequentially. Ideally, have the cluster under high CPU load so cluster updates are slower. - Observe a
ResourceNotFoundexception similar to this:
2024-09-03 12:57:47 [2024-09-03T19:57:47,678][ERROR][o.o.m.a.c.TransportCreateConnectorAction] [790661ea3297] Failed to create connector Cohere Chat Model 2024-09-03 12:57:47 org.opensearch.ResourceNotFoundException: The ML encryption master key has not been initialized yet. Please retry after waiting for 10 seconds. 2024-09-03 12:57:47 at org.opensearch.ml.engine.encryptor.EncryptorImpl.checkMasterKeyInitialization(EncryptorImpl.java:387) ~[opensearch-ml-algorithms-multi-tenancy-2.16.0-SNAPSHOT.jar:?] 2024-09-03 12:57:47 at org.opensearch.ml.engine.encryptor.EncryptorImpl.initMasterKey(EncryptorImpl.java:134) ~[opensearch-ml-algorithms-multi-tenancy-2.16.0-SNAPSHOT.jar:?]
What is the expected behavior?
Tests start with no existing indices (except for security).
What is your host/environment?
- OS: Reproduced locally on macOS but also occurs on CI tests
Do you have any additional context?
The root of the problem is this method:
@SuppressWarnings("unchecked")
@After
protected void wipeAllODFEIndices() throws IOException {
Response response = adminClient().performRequest(new Request("GET", "/_cat/indices?format=json&expand_wildcards=all"));
// ... other code ...
for (Map<String, Object> index : parserList) {
String indexName = (String) index.get("index");
if (indexName != null && !".opendistro_security".equals(indexName)) {
adminClient().performRequest(new Request("DELETE", "/" + indexName));
}
}
}
}
The performRequest() call doesn't wait for the cluster state to be updated before returning, so the test case finishes and the next one is allowed to start while the cluster state is still processing the deletions.