opencti
opencti copied to clipboard
Control Redis memory usage: limit stix_ids array length in redis stream entries to 1000 ids
Proposed changes
When redis stream.opencti
entries have a extension-definition--ea279b3e-5c71-4632-ac08-831c66a786ba
modification which changes the stix_ids
field, this appears to sometimes grow extremely large during ingestion for common entity types which don't have a deterministic formula for producing a STIX id.
This seems to be able to result in a long sequence of entries in redis, during re-ingestion of the same entity (such as a malware
STIX type) with new randomly-assigned STIX ids. My observation is that these get added to the stream once for each new STIX id encountered for the same item, and the list of stix_ids
in the added/updated entity can grow infinitely. This is the source of one of the reasons why redis
memory consumption can grow so high, under some circumstances, despite implementing a low TRIMMING size (such as 100000
).
This change proposes to identify when this list exceeds an upper limit (set to 1000
in this case) and will cut the list down such that it only contains the trailing 1000
items from the provided list, in redis stream inserts only. This appears to fix the Out-Of-Memory conditions I was experiencing while ingesting from the ESET connector (issue linked below), and also seems to continue to work elsewhere that I have tested, from what I can tell. Additionally, with TRIMMING set to 100000
, this change appears to keep the redis
memory usage around 200-500MB, where I have tested, rather than it consuming multiple GB.
Definitely recognize that this might not be an ideal solution, due to the potential for data loss in the stix_ids
, so proposing this change for some broader testing & feedback, particularly from people who leverage the streams features, as well as those who better understand how the local system uses its own stream.
The problem we are running into, and this tries to address, is that redis memory consumption can easily grow to 10-20GB, even with a very low TRIMMING value, due to the way stix_ids
is reported in the data stream, per the situation above. This creates an unpredictable availability bug in the system, as well as putting extremely high RAM resource pressure on the system (that can be expensive to run) for lots of highly-duplicative information that is going into the stream.
If the community/maintainers don't agree with this change, we would appreciate some alternative recommendations to address this. Allocating more RAM for redis
is not a viable mitigation in this case.
Related issues
- https://github.com/OpenCTI-Platform/connectors/issues/1834
Checklist
- [x] I consider the submitted work as finished
- [x] I tested the code for its functionality
- [ ] I wrote test cases for the relevant uses case (coverage and e2e)
- [ ] I added/update the relevant documentation (either on github or on notion)
- [ ] Where necessary I refactored code to improve the overall quality
Codecov Report
Attention: Patch coverage is 70.00000%
with 6 lines
in your changes are missing coverage. Please review.
Project coverage is 67.58%. Comparing base (
e7b4870
) to head (92e7351
). Report is 30 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
...cti-platform/opencti-graphql/src/database/redis.ts | 70.00% | 6 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #6774 +/- ##
==========================================
- Coverage 68.01% 67.58% -0.43%
==========================================
Files 538 545 +7
Lines 65711 66477 +766
Branches 5568 5583 +15
==========================================
+ Hits 44691 44931 +240
- Misses 21020 21546 +526
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Very interesting. Do you have any idea if that would affect vulnerability and data-source types? We are seeing random redis out-of-memory incidents, and have narrowed the incidents to when we use a lot of these types
Very interesting. Do you have any idea if that would affect vulnerability and data-source types? We are seeing random redis out-of-memory incidents, and have narrowed the incidents to when we use a lot of these types
Maybe, it is certainly possible - both types you describe are entity types like the malware
type (not the artifact type) that I encountered the issue with. In my case, the source of the data doesn't store malware families / names as distinct items, so they don't each have a dedicated STIX id in the source platform. This results in a new id being generated for them each time they're queried from the platform because the malware
type doesn't appear to have a deterministic STIX id generation algorithm. Then, these STIX ids appear to be added into the resulting entity within OpenCTI, maybe as part of the de-duplication algorithm. The full list of stix_ids
is written in the resulting stream entries, each time a new STIX id is added, rather than just an addition action for the new alias STIX id. At least, I think that is what's going on.
In my proposed fix, I'm assuming that the newest item(s) will be at the end of the list, and that any instance following the data stream will have consumed the earlier records (with earlier parts of the STIX id alias list) first, and will be retaining the old, ignoring the duplicates, and adding the new, as part of the stream ingest process.
Attaching a screenshot to further illustrate the behavior I am seeing (viewing the HISTORY tab of a malware created from the external feed):
I've hovered over the top like to display the full message which says it added a whole list of STIX IDs to the malware entity. Looking down that list, you can see that the first & second STIX IDs are always the same (so the order of the list appears preserved). The entry I am hovering over lists 3 STIX IDs that were added, as well as that 44,558 other STIX IDs were also added. Hovering on lower (older) items in this list shows the same 3 items at the head of the list, and the count of additional items decreases by one, each line down (so, 44557, 44556, 44555, ...., and so on). This un-shortened list is never displayed through the UI. Every newer line contains the entire list that was in the prior line, plus the one new additional STIX ID added by the event representing the new line.
Normally, redis
contains this entire list of STIX IDs - so assuming about 40 bytes for each ID, that means each line can take over 1.5MB apiece. Multiply this by 10,000 just to calculate the space needed to store the 10,000 entries in redis between 40,000 and 50,000 in my sequence, and it could require 15GB just to store those 10,000 modification entries alone - not even counting the 40,000 that precede it.
Also, just another observation: I am curious if this behavior might be contributing to the pattern that some have observed of ingest speed (bundles / second) gradually falling over the course of long ingests of large data sets...as it is reflecting data that's being sent/read from the elasticsearch back-end too.
I've hovered over the top like to display the full message which says it added a whole list of STIX IDs to the malware entity. Looking down that list, you can see that the first & second STIX IDs are always the same (so the order of the list appears preserved). The entry I am hovering over lists 3 STIX IDs that were added, as well as that 44,558 other STIX IDs were also added. Hovering on lower (older) items in this list shows the same 3 items at the head of the list, and the count of additional items decreases by one, each line down (so, 44557, 44556, 44555, ...., and so on). This un-shortened list is never displayed through the UI. Every newer line contains the entire list that was in the prior line, plus the one new additional STIX ID added by the event representing the new line.
Normally,
redis
contains this entire list of STIX IDs - so assuming about 40 bytes for each ID, that means each line can take over 1.5MB apiece. Multiply this by 10,000 just to calculate the space needed to store the 10,000 entries in redis between 40,000 and 50,000 in my sequence, and it could require 15GB just to store those 10,000 modification entries alone - not even counting the 40,000 that precede it.
Thanks for your explanation. I've requested our team to create only v5 UUIDs, so at least our own bundles will not add to the problem. I supose 3rd party will still create a random UUID for the same objects so they will acumulate and create the situation you explained.
On your last message observation, I think merging bundles is a time consuming operation, so probably you are right. If you receive the same object a lot of times, and it is merged as its UUID is different each time, it most probably makes ingestion slower.
@AlexSanchezN Another issue is that tons and tons of updates to the same object(s) in the queue means that those STIX objects all get locked and force the platform to single-thread the updates regardless of the number of platform or worker instances running. The other workers have to wait until the first worker to lock the objects completes the time-consuming update operation writing that long list of ids to the entity.
I wonder if some sort of background consolidation process would be possible. Once the bundle has been correctly ingested, alternative ids (I mean v4 random UUIDs) should not be needed anymore by other bundles. That is the main reason that new alternative ids are generated no? Because they change in other/different bundles each time the same object is created. They are not deterministic. So maybe a background process could replace the alternative ids on the other objects that where in the bundle, with the main umique v5 UUID id. That would remove >40.000 alternative ids from the object, and make merging/upserting/updating faster.
I don't know if I've explained myself! :-)
I think this pull request cannot be integrated like this.
We have different use cases where stix_ids must be correctly maintained.
- After a merge => consolidation of the ids are mandatory.
- Upsert at integration => To ensure correct data continuity => This consolidation can lead to stix_ids explosion if the connector is not correctly developed or if the remote source is doing really bad stuff. Main problem that try to resolve this PR.
- When stix_id have a real interest from the remote source => we have some users really attached to keep this consolidation and will not accept any purge or remove.
With this 3 use cases I have no idea how we can do the diff between them.
my 2 cent about what could be a good solution.
We need to generate the standard id on the client python and replace dynamically all the ids with the predicted ones.
By default, drop the base id of the source but add an option to configure in the connector if you want to consolidate them.
+ a cleanup script for existing platform to remove all the bad stix_ids already written
Thanks. I like this idea, and then the current behavior can be voluntary for any connectors which are incompatible with overwriting the external STIX IDs, and which then can be evaluated to ensure they aren't blowing up the STIX IDs aliases. One thing to keep in mind for cases where the ID replacement occurs: any relationships or other references also need to be updated with the new STIX ID. It makes sense that there are user cases out there where the STIX ID we receive from an external source is unique and persistent, and we want to keep track of it in the platform too, so that it may be used for future reference in the external system (which may not be using the same STIX ID generation algorithm we are).
@SouadHadjiat fixed
Closing this PR as started working on https://github.com/OpenCTI-Platform/opencti/issues/7158. Following the solution
We need to generate the standard id on the client python and replace dynamically all the ids with the predicted ones.
By default, drop the base id of the source but add an option to configure in the connector if you want to consolidate them.
+ a cleanup script for existing platform to remove all the bad stix_ids already written