neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] Failing integ test due to model is not deployed due to open memory circuit breaker

Open martin-gaievski opened this issue 1 year ago • 11 comments

What is the bug?

Tests are failing in distribution pipeline for 2.12. It's about 6-8 failing tests, exact tests are always different. Example of a trace from test runner: https://build.ci.opensearch.org/blue/organizations/jenkins/integ-test/detail/integ-test/7696/pipeline/102

tests run results are something like:

Suite: Test class org.opensearch.neuralsearch.processor.ScoreNormalizationIT

  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.processor.ScoreNormalizationIT.testL2Norm_whenOneShardAndQueryMatches_thenSuccessful" -Dtests.seed=3DD97DF7CBE58104 -Dtests.security.manager=false -Dtests.locale=no-NO -Dtests.timezone=Europe/London -Druntime.java=21

  2> org.opensearch.client.ResponseException: method [DELETE], host [http://localhost:9200/], URI [/_search/pipeline/phase-results-normalization-pipeline], status line [HTTP/1.1 404 Not Found]

    {"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"pipeline [phase-results-normalization-pipeline] is missing"}],"type":"resource_not_found_exception","reason":"pipeline [phase-results-normalization-pipeline] is missing"},"status":404}
        at __randomizedtesting.SeedInfo.seed([3DD97DF7CBE58104:430A5A148B39B236]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:735)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:708)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.deleteSearchPipeline(BaseNeuralSearchIT.java:881)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.wipeOfTestResources(BaseNeuralSearchIT.java:1057)
        at app//org.opensearch.neuralsearch.processor.ScoreNormalizationIT.testL2Norm_whenOneShardAndQueryMatches_thenSuccessful(ScoreNormalizationIT.java:156)
  2> NOTE: leaving temporary files on disk at: /tmp/tmpckmb2w5s/neural-search/build/testrun/integTest/temp/org.opensearch.neuralsearch.processor.ScoreNormalizationIT_3DD97DF7CBE58104-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=2005, maxMBSortInHeap=6.984387798453897, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=no-NO, timezone=Europe/London
  2> NOTE: Linux 6.1.49-70.116.amzn2023.x86_64 amd64/Eclipse Adoptium 21.0.1 (64-bit)/cpus=16,threads=3,free=283003880,total=536870912
  2> NOTE: All tests run in this JVM: [NormalizationProcessorIT, HybridQueryIT, NeuralQueryIT, NeuralSparseQueryIT, NeuralSearchIT, ValidateDependentPluginInstallationIT, NeuralQueryEnricherProcessorIT, ScoreCombinationIT, ScoreNormalizationIT]

42 tests completed, 10 failed

How can one reproduce the bug?

It's only in distribution pipeline, in plugin CI and in local tests are passing. In local and plugin CI the memory settings are higher as they are override at the plugin level:

-Xms1g -Xmx4g

https://github.com/opensearch-project/neural-search/blob/main/build.gradle#L388C14-L388C31

What is your host/environment?

Issue is for 2.12, should also be same in 2.x and main

Do you have any additional context?

Example of a server log from test cluster from my local copy of infra build tool:

stdout.txt

Following error is in the log, corresponding to a failed test. Memory CB from ml-commons is opened, then JVM GC kicks in and frees some memory. After that next few tests will be successful, then situation repeats.

The cluster is started with -Xms1g, -Xmx1g , ignoring plugin settings. As of time of writing there is no way to change that setting in a test cluster for distribution.

Probably it's possible to check CB state before deploying a model from the test, or try to deploy it, and if exception occurs and its due to open CB then wait and retry.

[2024-02-06T19:39:52,891][INFO ][o.o.n.Node               ] [node_name_9200] node name [node_name_9200], node ID [OAYF21-xRbu9dc9StXHecg], cluster name [opensearchcluster1], roles [ingest, remote_cluster_client, data, cluster_manager]
[2024-02-06T19:39:54,748][INFO ][o.o.n.p.NeuralSearch     ] [node_name_9200] Registering hybrid query phase searcher with feature flag [plugins.neural_search.hybrid_search_disabled]
[2024-02-06T19:39:55,203][INFO ][o.o.m.b.MLCircuitBreakerService] [node_name_9200] Registered ML memory breaker.
[2024-02-06T19:39:55,203][INFO ][o.o.m.b.MLCircuitBreakerService] [node_name_9200] Registered ML disk breaker.
[2024-02-06T19:39:55,204][INFO ][o.o.m.b.MLCircuitBreakerService] [node_name_9200] Registered ML native memory breaker.
[2024-02-06T19:39:55,290][INFO ][o.r.Reflections          ] [node_name_9200] Reflections took 42 ms to scan 1 urls, producing 21 keys and 58 values 
[2024-02-06T19:39:55,511][INFO ][o.o.t.NettyAllocator     ] [node_name_9200] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=1gb}]
[2024-02-06T19:39:55,593][INFO ][o.o.d.DiscoveryModule    ] [node_name_9200] using discovery type [zen] and seed hosts providers [settings]
[2024-02-06T19:39:55,886][WARN ][o.o.g.DanglingIndicesState] [node_name_9200] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2024-02-06T19:39:56,192][INFO ][o.o.n.Node               ] [node_name_9200] initialized
[2024-02-06T19:39:56,192][INFO ][o.o.n.Node               ] [node_name_9200] starting ...
[2024-02-06T19:39:56,398][INFO ][o.o.t.TransportService   ] [node_name_9200] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2024-02-06T19:39:56,400][INFO ][o.o.t.TransportService   ] [node_name_9200] Remote clusters initialized successfully.
[2024-02-06T19:39:56,532][WARN ][o.o.b.BootstrapChecks    ] [node_name_9200] max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

martin-gaievski avatar Feb 07 '24 20:02 martin-gaievski