snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Streaming transformer's batch emit times should be more flexible
Currently, if the streaming transformer is configured with 5 minute windows, then it emits batches at exactly 12:00, 12:05, 12:10 etc. If there are, say, 50 instances of the streaming transformer running in parallel, then we get 50 batches all emitted at exactly the same time. This creates a backlog for the loader, which the loader slowly handles over the course of a few minutes.
It would be slightly better if the 50 instances emit batches at slight offsets to each other. For example, instance 1 emits batches at 12:01, 12:06, 12:11, and instance 2 emits batches at 12:02, 12:07, 12:12. This way, the loader receives a more steady stream of batches to load, and it could reduce the overall latency of events reaching the warehouse.
This is best implemented by letting the transformer randomly choose the time of its first window when it first starts up.
See also #1197, which is the main reason we're going to need flexible emit times.