ml-commons [BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker

What is the bug? When uploading model with _upload API, system return following response:

Error response for model upload: Memory Circuit Breaker is open, please check your resources!

How can one reproduce the bug? Steps to reproduce the behavior:

Run ./gradlew integTest on neural search on java 21
Wait for test to complete.

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.

Exact tests that are failing are random, but error happens for every execution of run tests command, its always 2 to 6 failing tests

What is the expected behavior? No CB error

What is your host/environment?

JDK: 21, for lower versions everything works
Version 2.14 (2.x) and main

Do you have any additional context? We upload model from ml-commons repo using following request payload: https://github.com/opensearch-project/neural-search/blob/main/src/test/resources/processor/UploadModelRequestBody.json

We use following sequence for model upload:

create model group
upload model, wait for task to complete, got model id Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L146
deploy model by model id Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L175

Following sequence of calls to delete resources:

undeploy, poll for trerminal state
delete model Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L916 there isn't call for model group delete as it has to be deleted when last associated model is deleted

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run. It's somehow related to https://github.com/opensearch-project/ml-commons/issues/1896, but at that time we lower the chance of test failures by increasing max heap size to 4Gb. For 2.14 it's not an option as per this global issue https://github.com/opensearch-project/neural-search/issues/667

Apr 10 '24 01:04 martin-gaievski

I'm not 100% sure if this is ml-commons bug. Seems like in the cluster memory usage is still very high. May be you can try setting up this setting to 100?

Apr 10 '24 16:04 dhrubo-os

we do have this set to 100 for neural-search https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L116. Let me try with different values for other setting for jvm heap plugins.ml_commons.jvm_heap_memory_threshold.

Apr 10 '24 19:04 martin-gaievski

I found that value of 95% for setting plugins.ml_commons.jvm_heap_memory_threshold is most effective, but it doesn't prevent tests from failing, instead of 4-6 failing tests now with some other optimizations it's 1-2 tests. Overall with this setting I think we're not solving the problem but delaying its manifestation.

Apr 10 '24 20:04 martin-gaievski

I believe the problem is related to the fact that after several load/unload, the un-released memory was hold by pytorch runtime library, which was used as a blackbox in DJL.

The most common use case of using pytorch is hosing a model server and performance is NO.1 priority, so it's designed to consume large memory even the model is unloaded. Our use case is special so that's why we don't recommend using pre-trained or local models in the production environment.

For this integration tests problem, can you reduce the number of load/unload in your tests? In other words, is it possible to finish all the necessary tests in only one model lifecycle? Also, can you try using a smaller model in the IT?

Apr 10 '24 21:04 Zhangxunmt

I think we already using the small model from ml-commons repo. I'll be pushing PR for tests refactoring, will check our tests and remove unnecessary model uploads plus merge few small test methods into large ones to reuse single model.

Apr 11 '24 17:04 martin-gaievski

@Zhangxunmt My team has an assumption that Memory CB does not calculate used memory properly, in particular mmaped files are also counted. That causes leak kind of behavior when with time after multiple undeployments amount of memory that is counted goes beyond the memory that is actually used.

I've verified this by following experiment:

in neural-search plugin we set memory CB threshold as 100%, run tests they have failed
disable memory CB. For that build opensearch min distribution, use ml-commons 2.13 branch with this commit
run same workload.

Step 1 confirms the issue. Step 3 shows that even with 100% threshold CB doesn't count memory usage correctly.

For repro the issue I setup https://github.com/opensearch-project/opensearch-build/ locally and point it to my custom branch of ml-commons.

I suggest that either ml-commons should add an option or setting to disable memory CB completely, or ignore CB check if threshold is set to >= 100%

Apr 30 '24 00:04 martin-gaievski

memory CB is disabled with heap threshold == 100. Resolving this issue.

May 23 '24 22:05 Zhangxunmt

ml-commons ml-commons copied to clipboard

[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker

ml-commons
ml-commons copied to clipboard