rita
rita copied to clipboard
Improve Efficiency of Analysis for All Beacon Modules
#Current behavior For all beacon modules, we analyze 24 hours worth of data that is comprised of the current hour's data plus the past 23 hours of data. We do this for each new hour that comes in. This means that we are, effectively, analyzing each hour of data 24 times until that hour is rolled out of the logs. Additionally, this requires mongo queries to retrieve the previous 23 hours of data.
#Desired behavior Rather than re-analyze the data that we have already analyzed, we should perform analysis on a single hour of data and store the key metrics (e.g., the histogram info and other metrics). When calculating the overall scores and metrics for each beacon module, we can combine together the 24-hours worth of stored metrics for each src-dst pair and calculate our final beacon scores. This should have multiple benefits in that:
We aren't performing redundant calculations that have already been done before We are performing calculations on a much smaller set of data at any given time We don't have to make mongo queries to pull out large amounts of connection information for every host for each beacon module. One potential issue that came up in discussion is that we would might miss beacons with an interval of 30 minutes or long as that would essentially be our Nyquist frequency when calculating time deltas for just one hour at a time rather than 24. Logan, however, pointed out that we can just store the last time stamp for a src-dst hour that we analyzed. When the next hour comes up, the first delta we calculate is between the first time stamp of the current hour and the last time stamp of the last hour (or whenever the last time stamp occurred for that src-dst pair). This would allow us to continue catching beacons with a long interval time.
The hope is that this approach will give a speed boost to all of the beacon modules, especially with larger data sets.