feat(batch-exports): Backfill all persons initially in S3
Problem
When backfilling a batch export, we go through all previous batches until we get to the desired end date. However, running a batch has a lot of overhead costs in network transmission and database updates. Backfills could run more efficiently if we were to do a single initial "big" backfill with all historical data up to the desired end date.
This is particularly relevant for persons batch export, as updates to persons will be overwritten by later updates to the same person, making them pointless to backfill in the first place. It is also particularly relevant for high-frequency batch exports, as those generate the largest amount of batch runs when backfilling, and thus pay the most in overhead costs.
Changes
- Add support for running backfills without a
interval_start. Currently, this is only supported for S3 batch exports. To do this, we are:- Adding a new
persons_batch_export_backfillview. Unfortunately ClickHouse doesn't quite yet support optional parameters which would have allowed us to re-use the previous view. - Updated code to handle
interval_startbeingNone. - Updated Django models to support
interval_start=None.
- Adding a new
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
Both.
How did you test this code?
Added unit tests were relevant.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.