cdp-backend
cdp-backend copied to clipboard
Add params to event gather pipeline to allow long-runnable and log errors / skipped events
Feature Description
A clear and concise description of the feature you're requesting.
Add parameters:
-
batch-size
an optional integer that will be used to iteratively slice and run the pipeline on that many events at a time. I.e. if the gather for the specified time range finds 50 events but the batch size is 10, the pipeline will run 5 independent times each with 10 events to process. -
skip-errored-events-during-processing
that will ignore events that raise an error during processing. Enough debug info should be gathered / kept that the log printed out after the pipeline finishes contains the event details and "the thing that errored". -
skip-errored-events-during-gather
that will ignore events that fail to scrape / gather. Similar to the above parameter, enough debug info should be printed after scraping. "Found 20 events, skipping 2 due to errors" for example.
Also would be really interesting to see if I can allow certain errors. retry-errors=[ConnectionError]
Use Case
Please provide a use case to help us understand your request in context.
I am backfilling a lot of data for certain instances and it is becoming annoying to process week by week. This is generally required for a couple of reasons:
- storage space on machine (GHA runners only have 16 GB of disk so can't download and process more than ~4 meeting videos at a time) -- hence batch size
- there are errors in less than 1% of events that aren't random connection errors. These are things like the video page being parsed incorrectly and such.