sentry icon indicating copy to clipboard operation
sentry copied to clipboard

fix(grouping): Only run grouping calculation once

Open lobsterkatie opened this issue 1 year ago • 0 comments

During grouping, the slowest part of the process is calculating the variants. Before Seer and before grouphash metadata, we only needed them once, but now we use them in two places in the Seer flow and are about to use them in another place for grouphash metadata. To avoid calculating them now potentially up to four times, this PR refactors things so that they're passed from the place where they're initially calculated (in event.get_hashes) through the various intermediate functions to the spots in the Seer flow where they're currently used. Along this path is the place where we'll need them from grouphash metadata also.

Notes:

  • In order to not have to change unrelated uses of get_hashes, instead of changing its return value I instead extracted most of its inner logic into a separate get_hashes_and_variants method. Now get_hashes calls get_hashes_and_variants (and just ignores the variants) and in the spot in ingest where we used to call get_hashes, we now call get_hashes_and_variants.

  • We have a few pairs of helpers, for calculating primary and secondary hashes, respectively, which need to have matching signatures - meaning if the primary-hash version returns variants, the secondary-hash version must, too. That said, we don't ever actually want to use the secondary variants, so rather than having the secondary version of each helper returning the real variants, I instead chose to return an empty dictionary. Since we ignore that part of the result it doesn't really matter, but I figured debugging-wise, it's easier to keep track of "this one I want, this one I don't" if one has real data and one is empty.

lobsterkatie avatar Oct 04 '24 18:10 lobsterkatie