Cortex [BUG] Cortex is unresponsive if too much jobs

[BUG] Cortex is unresponsive if too much jobs

Open azgaviperr opened this issue 3 years ago • 22 comments

Request Type

Bug

Work Environment

Question	Answer
OS version (server)	Docker SWARM
OS version (client)	Viperr,
Virtualized Env.	True
Dedicated RAM	16 GB
vCPU	8
Cortex version / git hash	3.1.1
Package Type	RPM, DEB, Docker, Binary, From source
Index type	Elasticsearch
Attachments storage	Local (GlusterFS)
Browser type & version	Firefox and Chrome

Problem Description

When ending a big quantity of artefact to Cortex to get analyze by a few analyzers, cortex becam unresponsive. Front Page is blank while answering code 200 and it is impossible to get access or communication using API. At the end of all jobs that continue running, service is again available.

Issue is report are not sent back to thehive, you need to rerun analyzer and result is given directly (cached result)

Steps to Reproduce

Add some artifact
Run them to a big quantity of Analyzer
Observate the unresponsivness

Possible Solutions

I did add this to the application.conf this helped in some case but not all.


akka {
  log-config-on-start = on

  actor {
    default-dispatcher {
      fork-join-executor {
        parallelism-max = 16
      }
      thread-pool-executor {
        fixed-pool-size = 16
      }
      throughput = 1
    }
    default-blocking-io-dispatcher {
      fork-join-executor {
        parallelism-max = 32
      }
      thread-pool-executor {
        fixed-pool-size = 32
      }
      throughput = 1
    }
  }
}

Complementary information

(add anything that can help identifying the problem such as log excerpts, screenshots, configuration dumps etc.)

Jun 28 '21 12:06 azgaviperr

I'm facing the same issue, tried to use the possible solution and if more than 10 analyzers or a lot of artifacts, cortex become unresponsive. And the same thing as described by @azgaviperr happens to me in thehive.

Jun 28 '21 12:06 D4rkw0lv3s

Duplicate of https://github.com/TheHive-Project/Cortex/issues/364 as far as I can tell, see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321 for a possible root cause.

Jun 28 '21 12:06 mback2k

We observe the same type of event. Happens for e.g. 60-90 total jobs spread over a handful analyzers.

In the application.log we see the below logs for the blank index-page

2021-06-28 14:49:52,920 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-377 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database
2021-06-28 14:49:52,920 [INFO] from org.thp.cortex.services.ErrorHandler in application-akka.actor.default-dispatcher-377 - GET /api/job/cgGkUnoBTzYZDjIcEFjz/waitreport?atMost=1%20second returned 401
org.elastic4play.AuthenticationError: Authentication failure
        at org.elastic4play.controllers.Authenticated.$anonfun$getContext$4(Authenticated.scala:272)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
2021-06-28 14:49:53,931 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-360 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database

Edit:

And for the Elasticsearch instance that is running on the same machine we get the following error that might be of interest in this case:

[2021-06-28T14:47:07,236][WARN ][r.suppressed             ] [cortex01] path: /cortex_6/_search, params: {scroll=60000ms, index=cortex_6}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:601) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:332) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:636) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:415) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:59) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:264) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:62) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:404) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:743) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1397) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1371) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.11.1.jar:7.11.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
        at org.elasticsearch.search.SearchService.createAndPutReaderContext(SearchService.java:643) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:627) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:420) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.access$500(SearchService.java:135) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.11.1.jar:7.11.1]
        ... 6 more

Jun 28 '21 12:06 danniranderis

Hello guys,

can anyone of you, who has the issue, give us some insights about the number of observables? jobs? analyzers? which analyzers....

The error is clear:

org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.

So there is a limit that is reached here, and we need to know which one it is

Jul 08 '21 10:07 nadouani

Sure,

It happen when running one observable through 16+ different analyzers or two observables with 10 different analyzers each.

Analyzers:

Crt_sh_Transparency_Logs_1_0
CyberCrime-Tracker_1_0
Cyberprotect_ThreatScore_1_0
DomainMailSPFDMARC_Analyzer_1_1
Fortiguard_URLCategory_2_1
GoogleDNS_resolve_1_0_0
GoogleSafebrowsing_2_0
MISP_2_1
Maltiverse_Report_1_0
Mnemonic_pDNS_Public_3_0
OTXQuery_2_0
PassiveTotal_Components_2_0
PassiveTotal_Malware_2_0
PassiveTotal_Osint_2_0
PassiveTotal_Trackers_2_0
PassiveTotal_Whois_Details_2_0
Pulsedive_GetIndicator_1_0
SinkDB_1_1
SpamhausDBL_1_0
Threatcrowd_1_0
URLhaus_2_0
Urlscan_io_Scan_0_1_0
Urlscan_io_Search_0_1_1
PassiveTotal_Enrichment_2_0

In real prod I will not use that much but if I have half of them configured and run more than 2 observables at same time it can't handle it.

Jul 08 '21 13:07 D4rkw0lv3s

@nadouani I don't think the error is that clear, please also see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321.

Jul 08 '21 15:07 mback2k

It's not so simple, sometimes it happen when using about 10 observables on one analyzer, sometimes it run okay. Most often it happen if run multiple (3) observables on multiple analyzer 10+ . And when anaylzers fails on error (like MISP not able to be requested) this seems to hit harder cortex.

This make cortex to be unresponsive with only one ip address selected

Aug 04 '21 07:08 azgaviperr

I think it is mostly related to saving artifacts returned by the analyzers into ES. That seems to fill up the connections to ES and make Cortex stuck. This can happen already with just a single analyzer running/finishing.

Aug 04 '21 16:08 mback2k

My ES is used also for the hive index and while cortex is unavailable thehive continue to work correctly. This seems to be an issue Cortex side and maybe bad queuing of http request.

I had the issue also today with 1 obs run against the misp analyzer.

Aug 05 '21 12:08 azgaviperr

Yes, with filling up the connections to ES I do exactly mean the Cortex HTTP connection pool and not the ES side. Our ES cluster is pretty big and does not even show any signs of an issue while Cortex is stuck. Also see the issue I linked above.

Aug 05 '21 17:08 mback2k

I was able to workaround this issue finally by modifying cortexutils to not return any artifacts so that trying to store them in ES no longer fills up all the connections and threads. This is the change I made in /lib/python3.6/site-packages/cortexutils/analyzer.py and now our Cortex is stable again:

    def report(self, full_report, ensure_ascii=False):
        """Returns a json dict via stdout.

        :param full_report: Analyzer results as dict.
        :param ensure_ascii: Force ascii output. Default: False"""

        summary = {}
        try:
            summary = self.summary(full_report)
        except Exception:
            pass

        super(Analyzer, self).report({
            'success': True,
            'summary': summary,
            'artifacts': [], #self.artifacts(full_report), # WORKAROUND HERE!
            'full': full_report
        }, ensure_ascii)

Sep 03 '21 09:09 mback2k

@mback2k Any possibe impact on this change except making in to works? Maybe when you need to import observables generated by analyzers ?

Sep 03 '21 16:09 azgaviperr

Of course the artifacts won't be saved anymore, but this is a trade off I am willing to make currently.

Sep 03 '21 17:09 mback2k

Thank you guys for your comments. I understand this is a blocker thing.

From @mback2k comments, the issue could be saving artifacts discovered by the jobs. Basically, @mback2k, you don't need to change cortexutils code as extracting the artifacts is an option that you can just disable by analyzer. If disabled, Cortex won't return any artifact from the job. Could you confim you have the option enabled?

Sep 04 '21 06:09 nadouani

@nadouani I will check this on Monday, but I think the configuration only allows to adjust the automatic extraction of artifacts. If an analyzer provides artifacts on it's own, e.g. from a sandbox report, then the option won't have any effect.

Also the main root cause is still the requests to ES being handled in a FIFO fashion by the asynchronous akka system. If an analyzer job finishes with hundreds of artifacts, saving these to ES block all other kind of requests to ES, including user authentication. With at most 30 concurrent connections to an ES cluster (10 per host with a max. of 30 connections in the pool) this can take some time and quickly get's out of hand if a lot of jobs are being run.

Sep 04 '21 10:09 mback2k

Yes, I now understand what your conclusion is. We will figure out how and when to fix that ;)

Sep 04 '21 10:09 nadouani

Thanks a lot! I would propose introducing some kind of prioritization for the requests to ES. ES requests as part of a browser/API request should have a higher priority over background ES requests (like saving results of finished jobs). The later should probably be done in an unblocking background fashion anyway, e.g. background requests shouldn't be in the way of foreground requests. Just my two cents. ;-)

Sep 04 '21 10:09 mback2k

@nadouani I just verified, we already had the global and per-analyzer setting like this: "auto_extract_artifacts":false, but this did not help with all analyzers as described above.

Sep 06 '21 12:09 mback2k

@nadouani @To-om any update on fixing this issue? :eyes:

Sep 20 '21 13:09 mback2k

Hello, still looking forward a fix for this issue.

Oct 07 '21 09:10 azgaviperr

Yes, same here. @nadouani does StrangeBee provide paid support/development for issues like this? I would be interested.

Oct 15 '21 11:10 mback2k

Hello, Any update on the matter ?

Jun 21 '22 06:06 azgaviperr

Cortex Cortex copied to clipboard

[BUG] Cortex is unresponsive if too much jobs

Request Type

Work Environment

Problem Description

Steps to Reproduce

Possible Solutions

Complementary information

Cortex
Cortex copied to clipboard