cdp-backend icon indicating copy to clipboard operation
cdp-backend copied to clipboard

Add params to event gather pipeline to allow long-runnable and log errors / skipped events

Open evamaxfield opened this issue 2 years ago • 0 comments

Feature Description

A clear and concise description of the feature you're requesting.

Add parameters:

  • batch-size an optional integer that will be used to iteratively slice and run the pipeline on that many events at a time. I.e. if the gather for the specified time range finds 50 events but the batch size is 10, the pipeline will run 5 independent times each with 10 events to process.
  • skip-errored-events-during-processing that will ignore events that raise an error during processing. Enough debug info should be gathered / kept that the log printed out after the pipeline finishes contains the event details and "the thing that errored".
  • skip-errored-events-during-gather that will ignore events that fail to scrape / gather. Similar to the above parameter, enough debug info should be printed after scraping. "Found 20 events, skipping 2 due to errors" for example.

Also would be really interesting to see if I can allow certain errors. retry-errors=[ConnectionError]

Use Case

Please provide a use case to help us understand your request in context.

I am backfilling a lot of data for certain instances and it is becoming annoying to process week by week. This is generally required for a couple of reasons:

  • storage space on machine (GHA runners only have 16 GB of disk so can't download and process more than ~4 meeting videos at a time) -- hence batch size
  • there are errors in less than 1% of events that aren't random connection errors. These are things like the video page being parsed incorrectly and such.

evamaxfield avatar Jun 27 '22 18:06 evamaxfield