ml-commons
ml-commons copied to clipboard
[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker
What is the bug?
When uploading model with _upload API, system return following response:
Error response for model upload: Memory Circuit Breaker is open, please check your resources!
How can one reproduce the bug? Steps to reproduce the behavior:
- Run ./gradlew integTest on neural search on java 21
- Wait for test to complete.
To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.
Exact tests that are failing are random, but error happens for every execution of run tests command, its always 2 to 6 failing tests
What is the expected behavior? No CB error
What is your host/environment?
- JDK: 21, for lower versions everything works
- Version 2.14 (2.x) and main
Do you have any additional context? We upload model from ml-commons repo using following request payload: https://github.com/opensearch-project/neural-search/blob/main/src/test/resources/processor/UploadModelRequestBody.json
We use following sequence for model upload:
-
create model group
-
upload model, wait for task to complete, got model id Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L146
-
deploy model by model id Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L175
Following sequence of calls to delete resources:
- undeploy, poll for trerminal state
- delete model Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L916 there isn't call for model group delete as it has to be deleted when last associated model is deleted
To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run. It's somehow related to https://github.com/opensearch-project/ml-commons/issues/1896, but at that time we lower the chance of test failures by increasing max heap size to 4Gb. For 2.14 it's not an option as per this global issue https://github.com/opensearch-project/neural-search/issues/667
I'm not 100% sure if this is ml-commons bug. Seems like in the cluster memory usage is still very high. May be you can try setting up this setting to 100?
we do have this set to 100 for neural-search https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L116. Let me try with different values for other setting for jvm heap plugins.ml_commons.jvm_heap_memory_threshold.
I found that value of 95% for setting plugins.ml_commons.jvm_heap_memory_threshold is most effective, but it doesn't prevent tests from failing, instead of 4-6 failing tests now with some other optimizations it's 1-2 tests. Overall with this setting I think we're not solving the problem but delaying its manifestation.
I believe the problem is related to the fact that after several load/unload, the un-released memory was hold by pytorch runtime library, which was used as a blackbox in DJL.
The most common use case of using pytorch is hosing a model server and performance is NO.1 priority, so it's designed to consume large memory even the model is unloaded. Our use case is special so that's why we don't recommend using pre-trained or local models in the production environment.
For this integration tests problem, can you reduce the number of load/unload in your tests? In other words, is it possible to finish all the necessary tests in only one model lifecycle? Also, can you try using a smaller model in the IT?
I think we already using the small model from ml-commons repo.
I'll be pushing PR for tests refactoring, will check our tests and remove unnecessary model uploads plus merge few small test methods into large ones to reuse single model.
@Zhangxunmt My team has an assumption that Memory CB does not calculate used memory properly, in particular mmaped files are also counted. That causes leak kind of behavior when with time after multiple undeployments amount of memory that is counted goes beyond the memory that is actually used.
I've verified this by following experiment:
- in neural-search plugin we set memory CB threshold as 100%, run tests they have failed
- disable memory CB. For that build opensearch min distribution, use ml-commons 2.13 branch with this commit
- run same workload.
Step 1 confirms the issue. Step 3 shows that even with 100% threshold CB doesn't count memory usage correctly.
For repro the issue I setup https://github.com/opensearch-project/opensearch-build/ locally and point it to my custom branch of ml-commons.
I suggest that either ml-commons should add an option or setting to disable memory CB completely, or ignore CB check if threshold is set to >= 100%
memory CB is disabled with heap threshold == 100. Resolving this issue.