Cortex icon indicating copy to clipboard operation
Cortex copied to clipboard

[BUG] Cortex is unresponsive if too much jobs

Open azgaviperr opened this issue 3 years ago • 22 comments

Request Type

Bug

Work Environment

Question Answer
OS version (server) Docker SWARM
OS version (client) Viperr,
Virtualized Env. True
Dedicated RAM 16 GB
vCPU 8
Cortex version / git hash 3.1.1
Package Type RPM, DEB, Docker, Binary, From source
Index type Elasticsearch
Attachments storage Local (GlusterFS)
Browser type & version Firefox and Chrome

Problem Description

When ending a big quantity of artefact to Cortex to get analyze by a few analyzers, cortex becam unresponsive. Front Page is blank while answering code 200 and it is impossible to get access or communication using API. At the end of all jobs that continue running, service is again available.

Issue is report are not sent back to thehive, you need to rerun analyzer and result is given directly (cached result)

Steps to Reproduce

  1. Add some artifact
  2. Run them to a big quantity of Analyzer
  3. Observate the unresponsivness

Possible Solutions

I did add this to the application.conf this helped in some case but not all.


akka {
  log-config-on-start = on

  actor {
    default-dispatcher {
      fork-join-executor {
        parallelism-max = 16
      }
      thread-pool-executor {
        fixed-pool-size = 16
      }
      throughput = 1
    }
    default-blocking-io-dispatcher {
      fork-join-executor {
        parallelism-max = 32
      }
      thread-pool-executor {
        fixed-pool-size = 32
      }
      throughput = 1
    }
  }
}

Complementary information

(add anything that can help identifying the problem such as log excerpts, screenshots, configuration dumps etc.)

azgaviperr avatar Jun 28 '21 12:06 azgaviperr

I'm facing the same issue, tried to use the possible solution and if more than 10 analyzers or a lot of artifacts, cortex become unresponsive. And the same thing as described by @azgaviperr happens to me in thehive.

D4rkw0lv3s avatar Jun 28 '21 12:06 D4rkw0lv3s

Duplicate of https://github.com/TheHive-Project/Cortex/issues/364 as far as I can tell, see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321 for a possible root cause.

mback2k avatar Jun 28 '21 12:06 mback2k

We observe the same type of event. Happens for e.g. 60-90 total jobs spread over a handful analyzers.

In the application.log we see the below logs for the blank index-page

2021-06-28 14:49:52,920 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-377 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database
2021-06-28 14:49:52,920 [INFO] from org.thp.cortex.services.ErrorHandler in application-akka.actor.default-dispatcher-377 - GET /api/job/cgGkUnoBTzYZDjIcEFjz/waitreport?atMost=1%20second returned 401
org.elastic4play.AuthenticationError: Authentication failure
        at org.elastic4play.controllers.Authenticated.$anonfun$getContext$4(Authenticated.scala:272)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:93)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:93)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
2021-06-28 14:49:53,931 [ERROR] from org.elastic4play.controllers.Authenticated in application-akka.actor.default-dispatcher-360 - Authentication failure:
        session: AuthenticationError User session not found
        pki: AuthenticationError Certificate authentication is not configured
        key: AuthenticationError Authentication failure
        init: AuthenticationError Use of initial user is forbidden because users exist in database

Edit:

And for the Elasticsearch instance that is running on the same machine we get the following error that might be of interest in this case:

[2021-06-28T14:47:07,236][WARN ][r.suppressed             ] [cortex01] path: /cortex_6/_search, params: {scroll=60000ms, index=cortex_6}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:601) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:332) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:636) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:415) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:59) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:264) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:62) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:404) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:743) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1397) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1371) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.11.1.jar:7.11.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
        at org.elasticsearch.search.SearchService.createAndPutReaderContext(SearchService.java:643) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.createOrGetReaderContext(SearchService.java:627) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:420) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService.access$500(SearchService.java:135) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:395) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.11.1.jar:7.11.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.11.1.jar:7.11.1]
        ... 6 more

danniranderis avatar Jun 28 '21 12:06 danniranderis

Hello guys,

can anyone of you, who has the issue, give us some insights about the number of observables? jobs? analyzers? which analyzers....

The error is clear:

org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.

So there is a limit that is reached here, and we need to know which one it is

nadouani avatar Jul 08 '21 10:07 nadouani

Sure,

It happen when running one observable through 16+ different analyzers or two observables with 10 different analyzers each.

Analyzers:

  • Crt_sh_Transparency_Logs_1_0
  • CyberCrime-Tracker_1_0
  • Cyberprotect_ThreatScore_1_0
  • DomainMailSPFDMARC_Analyzer_1_1
  • Fortiguard_URLCategory_2_1
  • GoogleDNS_resolve_1_0_0
  • GoogleSafebrowsing_2_0
  • MISP_2_1
  • Maltiverse_Report_1_0
  • Mnemonic_pDNS_Public_3_0
  • OTXQuery_2_0
  • PassiveTotal_Components_2_0
  • PassiveTotal_Malware_2_0
  • PassiveTotal_Osint_2_0
  • PassiveTotal_Trackers_2_0
  • PassiveTotal_Whois_Details_2_0
  • Pulsedive_GetIndicator_1_0
  • SinkDB_1_1
  • SpamhausDBL_1_0
  • Threatcrowd_1_0
  • URLhaus_2_0
  • Urlscan_io_Scan_0_1_0
  • Urlscan_io_Search_0_1_1
  • PassiveTotal_Enrichment_2_0

In real prod I will not use that much but if I have half of them configured and run more than 2 observables at same time it can't handle it.

D4rkw0lv3s avatar Jul 08 '21 13:07 D4rkw0lv3s

@nadouani I don't think the error is that clear, please also see https://github.com/TheHive-Project/Cortex/issues/364#issuecomment-861452321.

mback2k avatar Jul 08 '21 15:07 mback2k

It's not so simple, sometimes it happen when using about 10 observables on one analyzer, sometimes it run okay. Most often it happen if run multiple (3) observables on multiple analyzer 10+ . And when anaylzers fails on error (like MISP not able to be requested) this seems to hit harder cortex.

image This make cortex to be unresponsive with only one ip address selected

azgaviperr avatar Aug 04 '21 07:08 azgaviperr

I think it is mostly related to saving artifacts returned by the analyzers into ES. That seems to fill up the connections to ES and make Cortex stuck. This can happen already with just a single analyzer running/finishing.

mback2k avatar Aug 04 '21 16:08 mback2k

My ES is used also for the hive index and while cortex is unavailable thehive continue to work correctly. This seems to be an issue Cortex side and maybe bad queuing of http request.

I had the issue also today with 1 obs run against the misp analyzer.

azgaviperr avatar Aug 05 '21 12:08 azgaviperr

Yes, with filling up the connections to ES I do exactly mean the Cortex HTTP connection pool and not the ES side. Our ES cluster is pretty big and does not even show any signs of an issue while Cortex is stuck. Also see the issue I linked above.

mback2k avatar Aug 05 '21 17:08 mback2k

I was able to workaround this issue finally by modifying cortexutils to not return any artifacts so that trying to store them in ES no longer fills up all the connections and threads. This is the change I made in /lib/python3.6/site-packages/cortexutils/analyzer.py and now our Cortex is stable again:

    def report(self, full_report, ensure_ascii=False):
        """Returns a json dict via stdout.

        :param full_report: Analyzer results as dict.
        :param ensure_ascii: Force ascii output. Default: False"""

        summary = {}
        try:
            summary = self.summary(full_report)
        except Exception:
            pass

        super(Analyzer, self).report({
            'success': True,
            'summary': summary,
            'artifacts': [], #self.artifacts(full_report), # WORKAROUND HERE!
            'full': full_report
        }, ensure_ascii)

mback2k avatar Sep 03 '21 09:09 mback2k

@mback2k Any possibe impact on this change except making in to works? Maybe when you need to import observables generated by analyzers ?

azgaviperr avatar Sep 03 '21 16:09 azgaviperr

Of course the artifacts won't be saved anymore, but this is a trade off I am willing to make currently.

mback2k avatar Sep 03 '21 17:09 mback2k

Thank you guys for your comments. I understand this is a blocker thing.

From @mback2k comments, the issue could be saving artifacts discovered by the jobs. Basically, @mback2k, you don't need to change cortexutils code as extracting the artifacts is an option that you can just disable by analyzer. If disabled, Cortex won't return any artifact from the job. Could you confim you have the option enabled?

nadouani avatar Sep 04 '21 06:09 nadouani

@nadouani I will check this on Monday, but I think the configuration only allows to adjust the automatic extraction of artifacts. If an analyzer provides artifacts on it's own, e.g. from a sandbox report, then the option won't have any effect.

Also the main root cause is still the requests to ES being handled in a FIFO fashion by the asynchronous akka system. If an analyzer job finishes with hundreds of artifacts, saving these to ES block all other kind of requests to ES, including user authentication. With at most 30 concurrent connections to an ES cluster (10 per host with a max. of 30 connections in the pool) this can take some time and quickly get's out of hand if a lot of jobs are being run.

mback2k avatar Sep 04 '21 10:09 mback2k

Yes, I now understand what your conclusion is. We will figure out how and when to fix that ;)

nadouani avatar Sep 04 '21 10:09 nadouani

Thanks a lot! I would propose introducing some kind of prioritization for the requests to ES. ES requests as part of a browser/API request should have a higher priority over background ES requests (like saving results of finished jobs). The later should probably be done in an unblocking background fashion anyway, e.g. background requests shouldn't be in the way of foreground requests. Just my two cents. ;-)

mback2k avatar Sep 04 '21 10:09 mback2k

@nadouani I just verified, we already had the global and per-analyzer setting like this: "auto_extract_artifacts":false, but this did not help with all analyzers as described above.

mback2k avatar Sep 06 '21 12:09 mback2k

@nadouani @To-om any update on fixing this issue? :eyes:

mback2k avatar Sep 20 '21 13:09 mback2k

Hello, still looking forward a fix for this issue.

azgaviperr avatar Oct 07 '21 09:10 azgaviperr

Yes, same here. @nadouani does StrangeBee provide paid support/development for issues like this? I would be interested.

mback2k avatar Oct 15 '21 11:10 mback2k

Hello, Any update on the matter ?

azgaviperr avatar Jun 21 '22 06:06 azgaviperr