opencti
opencti copied to clipboard
Very low consumer ACK rate
Description
Having a very hard time loading data to the platform, uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog
Environment
- OS (where OpenCTI server runs): CentOS stream 9
- OpenCTI version: 5.11.14
- OpenCTI client: Frontend
- Other environment details:
15x Platform (No Web) @ 1c 6Gb (Reserved) 25x Worker @ 1c 4Gb (Reserved) k8s Clustered deployment Elastic Cluster: 3x Data Node @ 16Gb 4c per RabbitMQ Cluster: 3x Node @ 4Gb 1c per Redis: 1x 16Gb 3 Host Networking: 25Gbps x 3 Hosts
None of these seem resource saturated, I am as scaled as I can and the platform is nearly not able to import data
Reproducible Steps
Steps to create the smallest reproducible scenario: N/A
Expected Output
ACK rate that imports data fast enough to not buildup a backlog
Actual Output
Additional information
Resource usage
NAME CPU(cores) MEMORY(bytes)
elastic-es-data-0 1923m 14645Mi
elastic-es-data-1 884m 11464Mi
elastic-es-data-2 1443m 14217Mi
elastic-es-leaders-0 131m 1931Mi
elastic-es-leaders-1 74m 1833Mi
elastic-es-leaders-2 46m 1927Mi
minio-66787954dd-9g7sn 1m 180Mi
opencti-api-59cc685c95-84qnp 1398m 1401Mi
opencti-api-59cc685c95-9j4vr 502m 550Mi
opencti-api-59cc685c95-bs6sk 2131m 5267Mi
opencti-api-59cc685c95-dhczm 2272m 3755Mi
opencti-api-59cc685c95-g6hnt 953m 3376Mi
opencti-api-59cc685c95-qzwjq 389m 535Mi
opencti-api-59cc685c95-z9t5g 398m 543Mi
opencti-api-59cc685c95-zkvfs 1829m 1929Mi
opencti-web-7c45ccb866-66wnj 810m 760Mi
opencti-web-7c45ccb866-c65fk 632m 677Mi
opencti-web-7c45ccb866-ldcg7 1147m 814Mi
opencti-web-7c45ccb866-rhp5v 698m 675Mi
opencti-web-7c45ccb866-xmzvb 449m 749Mi
opencti-worker-78d7d54ff7-4j6nb 5m 42Mi
opencti-worker-78d7d54ff7-58wjh 4m 42Mi
opencti-worker-78d7d54ff7-6d6q9 4m 42Mi
opencti-worker-78d7d54ff7-84795 3m 42Mi
opencti-worker-78d7d54ff7-8txgh 4m 42Mi
opencti-worker-78d7d54ff7-9nchs 5m 42Mi
opencti-worker-78d7d54ff7-dhv56 4m 42Mi
opencti-worker-78d7d54ff7-ffh2b 4m 42Mi
opencti-worker-78d7d54ff7-m5pkk 1m 42Mi
opencti-worker-78d7d54ff7-nxpwf 4m 42Mi
opencti-worker-78d7d54ff7-p7v7j 3m 42Mi
opencti-worker-78d7d54ff7-qq5cl 4m 42Mi
opencti-worker-78d7d54ff7-tj5vx 3m 42Mi
opencti-worker-78d7d54ff7-w9z66 3m 42Mi
opencti-worker-78d7d54ff7-wclsm 3m 42Mi
opencti-worker-78d7d54ff7-wd7rl 4m 42Mi
opencti-worker-78d7d54ff7-wphtq 4m 42Mi
opencti-worker-78d7d54ff7-wsl49 1m 42Mi
opencti-worker-78d7d54ff7-wv8l5 3m 42Mi
opencti-worker-78d7d54ff7-xbrh4 5m 42Mi
rabbitmq-server-0 17m 767Mi
rabbitmq-server-1 16m 822Mi
rabbitmq-server-2 26m 683Mi
redis-0 503m 14888Mi
Screenshots (optional)
@MaxwellDPS I'm not sure exactly what are you looking for. What is the bug exactly? Is this a feature request to improve our ingestion perfromance? In the meantime, I encourage you to ask your question on our Slack, using the channel OpenCTI, so that our community might be able to help you out.
Hey @nino-filigran!
First thing I did was ask slack 🦗 🦗 🦗 https://filigran-community.slack.com/archives/CHZC2D38C/p1700517109042759
What is the bug exactly?
^^ Uploading any data takes a long time to process. 50K items in the queue can take > 3 hours, this leads to seeing an ack rate sustained of ~.7/s. There is not a manageable way to import this data before it begins to backlog... Not sure how to be more clear here :/
The platform is flying like a Cessna in a F22 airframe, something is wrong. It has power, is configured per the docs for a clustered deployment. I dont have a deep enough understanding of the platforms DB, Redis arch to keep looking for bottle necks aside the obvious and I see no resource contention
I will take anything here: Docs (not in the troubleshooting guide), guidance as to what to try to modify to debug, etc. But not sure what else yall need to help.
@nino-filigran @SamuelHassine @richard-julien same as all these, its been a running issue. The issue appeared in 5.7.6 for me, and has stayed in 5.8.X, 5.10.X, 5.11.X.
Please can we get some further investigation here, more than willing to help debug on my end
https://github.com/OpenCTI-Platform/opencti/issues/3513 https://github.com/OpenCTI-Platform/opencti/issues/4722 https://github.com/OpenCTI-Platform/opencti/issues/4748
It will take 24 days and 2 hours to process 1.05M bundles
For the sake of testing I just doubled CPU and memory on everything (Including ES, RMQ, Redis, and all non-connector OCTI pods) This caused no change in the ACK rate.
The only thing this will marginally improve ack rate is restarting the platform and even that is short lived.
I have also tried adjusting Redis trimming, and lock timing and retry jitter, not seeing any tangible change in performance from that either.
Here is what this looks like in RMQ over time (spikes are restarts of the OCTI cluster)...
This will take 585 hours (or 24 days and 2 hours) to process at an average rate of 0.5/s, this is non-useable
5.7.6 Upgrade
Y'all, please take a serious look at this, it has been an issue since 5.7.2 I've been reporting since summer
This is when 5.7.6 rolled out, prompting the creation of https://github.com/OpenCTI-Platform/opencti/issues/3513. There is a very clear correlation, to that and the issues.
I'm almost positive it has todo w/ elastic, ES Rally shows all good, so seems like data structure. We are running well in the bounds of the recommended specs, we are deployed as a cluster per recommendations and docs.
After all the work we did with y'all to work on the elastic stack and provide docs guidance @richard-julien, please help 🙏
There is an issue with upgrade path
Will be updating with some new metrics next week following a complete re-sync to a new platform.
The old instance (160M Docs) takes on avg 30-45 seconds to load the dashboard, and 60+s to load a large report The new instance w/ half the data (80M Docs so far) loads the dashboard on avg ~5-15s , and ~10s to load the same report
Also want to note, the new instance is seeing an ack rate of ~15/s over 7 days vs the old ones <1/s (Tracked via Prom)
Both platforms are deployed to the same cluster in the exact same config
https://github.com/OpenCTI-Platform/opencti/issues/5500
This seems likely related, @SamuelHassine @richard-julien this old database is gonna be nuked in <3 weeks. If y'all want anything to further debug here I will need to know soon
Hi @MaxwellDPS, still sorry to not have kind of magic answer for this performance problem. The performance you have its not what we observed in our hosted instances or other users that have more than 1 billion of docs. Don't get me wrong, 60+s to load a large report its not a normal situation, but the performance are not linear and depends of the type of data, directly related to custom connectors you could have developed and/or deployed. Elasticsearch indices design and rollover strategy can also play a large role.
This is unfortunate but in order to remediate this situation we need to have access to the data to investigate the data shape and replicate this behavior internally to imagine some possible remediation (connectors code improvements, cleanup script, ...). Of course we still are very open to sign an NDA / do anything required to unlock this situation, have access to your data and solve your problem.
Hey @richard-julien, thats great to hear, I can offer access to the data. But I do need a way we can share that in a meaningful way.
Given the scope here, I really dont know where to start and am deferring to y'all.
We can open a sharing directory in the Filigran onedrive. This way you should be able to share a complete snapshot of your elasticsearch indices. Please contact me with a private slack message to setup the transfer. Thanks
@richard-julien @SamuelHassine I think I may finally have a root cause, not 100% sure but it makes sense through and through aswell as why y'all cant recreate it Would all of these symptoms make sense if there were hiccups like TCP RSTs in the connection to the redis server?
I am seeing a TON of these Extending resources for long processing task
errors for redis in the platform logs. I know we have an elevated rate of RSTs that seem to be from istio, if the connection to redis was lost is it possible OCTI is waiting for locks to expire and that could be artificially throttling the platfom?!
Previously the low ack rate assumption was Elastic, and a bad database. I have resolved the Bad DB w/ a sync to a new platforom, but the ack rate is still low, but bursts which would be explainable if a lock expires.
Is there anything we can do or can y'all check the Redis implmentation for anything that may cause something like this
Seeing these redis errors, seems like PubSub is being Killed @richard-julien any recommendations on this or setting client-output-buffer-limit
Client id=3675 addr=192.168.1.185:55178 laddr=192.168.7.131:6379 fd=13 name= age=2391 idle=2 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=8 omem=37617856 tot-mem=37655672 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:09:53.741 # Client id=4818 addr=192.168.1.185:59572 laddr=192.168.7.131:6379 fd=13 name= age=18 idle=3 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=15 omem=39031144 tot-mem=39033128 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
1:M 13 Feb 2024 23:10:38.476 # Client id=4838 addr=192.168.1.185:38746 laddr=192.168.7.131:6379 fd=13 name= age=22 idle=5 flags=P db=0 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=16384 rbp=0 obl=0 oll=16 omem=40456576 tot-mem=40473920 events=rw cmd=psubscribe user=default redir=-1 resp=2 lib-name= lib-ver= scheduled to be closed ASAP for overcoming of output buffer limits.
Setting redis-cli config set client-output-buffer-limit "pubsub 512m 128m 0"
cleared up these logs, and ack rate while lower than expected has been more stable
Julien, Samuel, 99% sure this is due to network interupts leading to a disconnect of the pubsub that is not reconnecting properly leading to low ack rates and odd data structure.
How can we proceed to fix this, still seeing it in 5.12.33, happy to open an issue for that?
Julien there seems to be an issue that causes the ack rtate to drop off, was seeing 12+ hours of >15/s then it just died, dropped to <0.5/s and didnt recover untill I flushed redis
this error started right as the ack rate dropped and directly corresponds to an ingest rate drop in elastic.
This error seems directly related to the ACK rate drop off, it seems like locking or pubsub is also playing a factor
Julien -- Please advise 🙏
errors{}.message = Error in store update event
errors{}.message = Update event must contains a valid previous patch
{
"category": "APP",
"errors": [
{
"attributes": {
"genre": "TECHNICAL",
"http_status": 500
},
"message": "Error in store update event",
"name": "DATABASE_ERROR",
"stack": "DATABASE_ERROR: Error in store update event\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:504:11)\n at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
},
{
"attributes": {
"genre": "BUSINESS",
"http_status": 500
},
"message": "Update event must contains a valid previous patch",
"name": "UNSUPPORTED_ERROR",
"stack": "UNSUPPORTED_ERROR: Update event must contains a valid previous patch\n at error (/opt/opencti/build/src/config/errors.js:8:10)\n at UnsupportedError (/opt/opencti/build/src/config/errors.js:83:51)\n at buildStixUpdateEvent (/opt/opencti/build/src/database/redis.ts:467:11)\n at buildUpdateEvent (/opt/opencti/build/src/database/redis.ts:493:10)\n at storeUpdateEvent (/opt/opencti/build/src/database/redis.ts:498:21)\n at updateAttributeMetaResolved (/opt/opencti/build/src/database/middleware.js:1966:27)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)\n at addStixRefRelationship (/opt/opencti/build/src/domain/stixRefRelationship.js:70:3)"
}
],
"inner_relation_creation": 0,
"level": "warn",
"message": "Error in store update event",
"operation": "StixRefRelationshipAdd",
"operation_query": "mutation StixRefRelationshipAdd($input:StixRefRelationshipAddInput!){stixRefRelationshipAdd(input:$input){id standard_id entity_type parent_types}}",
"size": 302,
"time": 224,
"timestamp": "2024-02-28T19:18:20.055Z",
"type": "WRITE_ERROR",
"user": {
"applicant_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
"call_retry_number": "1",
"group_ids": [],
"ip": "192.168.6.192",
"organization_ids": [],
"socket": "query",
"user_id": "6a4b11e1-90ca-4e42-ba42-db7bc7f7d505",
"user_metadata": {}
},
"version": "5.12.33"
}
Samuel//Julien - To give a solid example, after running fine since ~12 PM PT 2/27. At 6:32AM PT 2/28 there was a drop in ack rate to almost 0.
When this happened there are the following symptoms:
- A large drop in index rate and latency on elastic.
- A large drop in Redis Keys
- A spike in OCTI platform errors and lower bundle counts
All of this is inline with https://github.com/OpenCTI-Platform/opencti/issues/3513 - 5.8; 5.9, 5.10, 5.11, 5.12 - Have not fixed this
Platform metrics showing a spike in bundle lock errors and a drop in bundle counts
Seeing a spike in errors logged by the platform (notably a lot of lock and update errors)
RabbitMQ showing ACK rate drops
Redis keys dropping
Elastic showing increased latency
Where can I continue to look, how can I get the ack rate back up? @richard-julien (sorry for the @ but its been a month sir)
I @MaxwellDPS, i still thinking that this kind of behavior (too many lock, performance drop, Redis disconnect due to large event) can be due to some of your data. We know that we are quite slow to ingest some cases where a mono source of messages try to create a large number of connected elements.
I personally work to find a solution to improve that case but for now the best solution to mitigate/workaround that behavior is to work on data sharding by changing the approach of the custom connectors you may have developed or deploy the same connector on different time range. I know this situation is frustrating but solving this will really require to dig in into your data and your connectors.
Some errors you have looks unrelated to this lock/perf problem (like Update event must contains a valid previous patch). Are you able to share your error logs?
https://github.com/OpenCTI-Platform/opencti/issues/6285 Possibly related
Wanted to update and close this out, these ongoing rate issues were part of #7436.
@richard-julien I really appreciate the help with this one! 🥇