Events only getting halfway through ingest causing ML grouping weirdness
Documenting for posterity:
We recently ran into a situation where it seemed as if a) we were ignoring Seer results and creating a new group even when Seer found a match, and b) sending the same hash to Seer multiple times.
But TL;DR, here is what actually happened:
- event A (with hash 1) put hash 1 into Seer, but disappeared before it could create a group for hash 1
- event B (with hash 2) matched hash 1 in Seer, but disappeared before it could create a group for hash 2 or be put into a group for hash 1
- event C (with hash 2) matched hash 1 in Seer, but no hash 1 group existed, so it created a group for hash 2
- event D (with hash 1) had no Seer match, so it created a group for hash 1
The more detailed explanation, with links below:
The sequence of events appeared to be this (where the starred lines seemed to indicate bugs):
- Event A with hash 1
- sent to seer
- seer found no match
- hash 1 added to seer DB
- sentry created group 1 for hash 1 events
- Event B with hash 2
- sent to seer
- seer matched hash 2 to hash 1
- based on seer result, sentry should have put event into group 1, but instead created group 2 for hash 2 events *****
- Event C with hash 2
- based on hash, sentry should have put event into group 2, but instead sent it to seer *****
- seer again matched hash 2 to hash 1
- based on seer result, sentry should have put event into group 1, but instead put it in group 2 *****
- Event D with hash 1
- based on hash, sentry should have put event into group 1, but instead sent it to seer *****
- seer found no match (seems you can't match to yourself?)
- sentry put event into group 1
But it turned out the sequence was actually this (where the starred lines indicate the actual problem):
- Event A with hash 1
- sent to seer
- seer found no match
- hash 1 added to seer DB
- event got lost partway through ingest, so group 1 wasn't created *****
- Event B with hash 2
- sent to seer
- seer matched hash 2 to hash 1
- event got lost partway through ingest, so it was neither assigned to group 1 (which would have been hard in any case, since group 1 didn't exist at that point) nor used to create group 2 *****
- Event C with hash 2
- group 2 didn't yet exist, so event sent to seer
- seer again matched hash 2 to hash 1
- based on seer result, sentry should have put event into group 1, but group 1 still didn't yet exist, so event was used to create group 2
- Event D with hash 1
- group 1 didn't yet exist, so event sent to seer
- seer found no match (seems you can't match to yourself?)
- sentry used event to create group 1
We know what was sent to seer, and what seer sent back, based on querying the logs for requests relating to hashes 1 and 2. We know events A and B got lost because searching for them in the issue stream yields no results. And we know that events C and D created groups 2 and 1, respectively, because they are each the oldest event in their issue, even though events A and B came in first.
A word of caution: Just to make this all more confusing, if you click on "oldest" in group 2, you actually get the second-oldest event, not event C, which is the true oldest event. Once we start recording timestamps with millisecond precision, presumably this problem will go away.