Cortex
Cortex copied to clipboard
Cortex website and API not responding to requests
Request Type
Bug
Work Environment
Question | Answer |
---|---|
OS version (server) | Ubuntu 18.04 LTS |
OS version (client) | not relevant |
Cortex version / git hash | 3.1.1-1 |
Package Type | Debian package |
Browser type & version | not relevant |
Problem Description
Cortex website and API not responding to requests.
Steps to Reproduce
A few minutes after (re)starting Cortex and a couple of jobs have been run, it is getting stuck, unresponsive or extremely slow.
Complementary information
Workaround relaxing the situation a bit, but not permanently
The following akka configuration helped a little bit, but Cortex will still end of unresponsive after a day or so:
# Debugging and workaround for performance issues
akka {
log-config-on-start = on
actor {
default-dispatcher {
fork-join-executor {
parallelism-max = 16
}
thread-pool-executor {
fixed-pool-size = 16
}
throughput = 1
}
default-blocking-io-dispatcher {
fork-join-executor {
parallelism-max = 32
}
thread-pool-executor {
fixed-pool-size = 32
}
throughput = 1
}
}
}
It seems like definitely having more blocking I/O dispatchers (whichever executor is used) than normal dispatchers is helping, but not solving with the issue.
Here is another thead-dump.txt of the same situation, maybe it helps.
I think I finally found the root cause. I switched on all debug logs and could identify that Cortex is very busy putting artifacts into Elasticsearch and it seems like the outbound HTTP request queue worked on by I/O dispatchers is filled with the creation of artifacts. So any other kind of outbound HTTP request has to wait, including authentication requests.
@To-om could you please help me and take a look into this?
It really looks like the workaround implemented with https://github.com/TheHive-Project/elastic4play/issues/97 is not sufficient, because operations running in the same execution context can still block each other. Or at least long-running operations like processing results (e.g. saving artifacts to ES) need to be moved into a separate execution context as far as I understand.
Just for the record: increasing search/scroll context number and timeout limits did not help!
Workaround posted here: https://github.com/TheHive-Project/Cortex/issues/374#issuecomment-912398773