Cortex icon indicating copy to clipboard operation
Cortex copied to clipboard

Cortex website and API not responding to requests

Open mback2k opened this issue 3 years ago • 5 comments

Request Type

Bug

Work Environment

Question Answer
OS version (server) Ubuntu 18.04 LTS
OS version (client) not relevant
Cortex version / git hash 3.1.1-1
Package Type Debian package
Browser type & version not relevant

Problem Description

Cortex website and API not responding to requests.

Steps to Reproduce

A few minutes after (re)starting Cortex and a couple of jobs have been run, it is getting stuck, unresponsive or extremely slow.

Complementary information

thread-dump.txt

Workaround relaxing the situation a bit, but not permanently

The following akka configuration helped a little bit, but Cortex will still end of unresponsive after a day or so:

# Debugging and workaround for performance issues
akka {
  log-config-on-start = on

  actor {
    default-dispatcher {
      fork-join-executor {
        parallelism-max = 16
      }
      thread-pool-executor {
        fixed-pool-size = 16
      }
      throughput = 1
    }
    default-blocking-io-dispatcher {
      fork-join-executor {
        parallelism-max = 32
      }
      thread-pool-executor {
        fixed-pool-size = 32
      }
      throughput = 1
    }
  }
}

It seems like definitely having more blocking I/O dispatchers (whichever executor is used) than normal dispatchers is helping, but not solving with the issue.

mback2k avatar May 28 '21 14:05 mback2k

Here is another thead-dump.txt of the same situation, maybe it helps.

mback2k avatar Jun 11 '21 18:06 mback2k

I think I finally found the root cause. I switched on all debug logs and could identify that Cortex is very busy putting artifacts into Elasticsearch and it seems like the outbound HTTP request queue worked on by I/O dispatchers is filled with the creation of artifacts. So any other kind of outbound HTTP request has to wait, including authentication requests.

@To-om could you please help me and take a look into this?

mback2k avatar Jun 15 '21 12:06 mback2k

It really looks like the workaround implemented with https://github.com/TheHive-Project/elastic4play/issues/97 is not sufficient, because operations running in the same execution context can still block each other. Or at least long-running operations like processing results (e.g. saving artifacts to ES) need to be moved into a separate execution context as far as I understand.

mback2k avatar Jun 16 '21 10:06 mback2k

Just for the record: increasing search/scroll context number and timeout limits did not help!

mback2k avatar Jul 08 '21 15:07 mback2k

Workaround posted here: https://github.com/TheHive-Project/Cortex/issues/374#issuecomment-912398773

mback2k avatar Sep 03 '21 09:09 mback2k