web-monitoring-processing
web-monitoring-processing copied to clipboard
Import script should follow more of a pipeline style
The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.
A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by FiniteQueue
. I’ve played that out somewhat in the task sheets script. We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.
(There might also be some better tools for this now. Things like Databay and Prefect either didn’t exist or I didn’t know about them at the time. Bonobo looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)
This probably isn’t high-priority enough to fit on the 2020 roadmap, but would be some nice cleanup to do if there’s time.
Potentially useful sketch I did of this a while back (left is current flow, right is broken up into more pipeline-y bits):