etl
etl copied to clipboard
Revise pcap parser file selection algorithm to eventually process 100% of the data
Revise the archive file selection algorithm for the pcap parser to rotate through all of the data in 10% batches.
Consider a hash based selection: if (HASH(filename)+epoch) % 10 == 0 { process file } where epoch is incremented every time the pcap gardner reaches the end of the data.
I don't think there is any particular reason we shouldn't just let this parse all the data. It should only take a few days. Then we should probably shut it off rather than reprocessing it regularly.
A more useful bug fix would be to change the processing location, so that we aren't moving data between regions. This is the biggest concern when processing 100% of the pcaps.
We could instead consider copying the table from staging.
On Fri, Sep 24, 2021 at 4:04 PM 'Matt Mathis' via code-reviews < @.***> wrote:
Revise the archive file selection algorithm for the pcap parser to rotate through all of the data in 10% batches.
Consider a hash based selection: if (HASH(filename)+epoch) % 10 == 0 { process file } where epoch is incremented every time the pcap gardner reaches the end of the data.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/m-lab/etl/issues/1022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDGT54QNYH4HHUTFYGXRHDUDTKTXANCNFSM5EWUT7JA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- To unsubscribe from this group and stop receiving emails from it, send an email to @.***
-- Greg Russell / Measurement-Lab https://memegen.googleplex.com/4558349824688128
We are now processing 10% of the pcaps every 16 days. Please update to process all current and historical files.
SELECT COUNT (DISTINCT date) AS days, MIN(parser.Time) OldestParse, FROM
mlab-oti.ndt_raw.pcap`
Yields: 838 2022-03-06 02:31:10.345666 UTC on 2022-03-22