dlt
dlt copied to clipboard
Same source/different configurations. Parallel pipelines for extraction, common pipeline for normalization/load. Call for ideas/hints.
Please let me know if this fits Slack more than GitHub, in which case I'll post it there instead, though I'm afraid it won't persist for long enough to get enough valuable answers.
What I'd like to do:
- Extract all the sources (actually, different auth configurations for the same source) independently (parallel pipelines)
- Run normalization/loading once for all the sources at once
Here's the setup:
- I pull from the same source type (e.g. Meta Ads) for hundreds of different configurations (e.g. ad accounts/clients)
- The data for each configuration ends up in the same schema/tables
Why won't I just use a single pipeline with many resources
- I need a "best-effort" approach resistant to both timeouts and errors. I.e. if any of the configuration fails because of an error, or times out, I'd still like to load all of those that finished successfully.
- In a single pipeline, if any of the resources fails, the entire pipeline fails.
- I can't silently catch the error on every resource, as that would be understood by dlt as just an "end of a resource", resulting in partial data being loaded for some configurations. I can't have that. I could theoretically exhaust every resource at once, though that would create a significant performance degradation (no yielding per item/batch possible).
- Despite the above, my timeout signal is external, and hence kills the entire pipeline, which would prevent any resources from actually getting loaded.
Why won't I just run as many parallel pipelines as I have configurations.
- I actually do this right now, but:
- It's very wasteful for the normalization phase, spinning up the normalization process pool and performing schema evolution calculations for lots of very small data chunks. It actually becomes way too inefficient once the number of pipelines reaches a couple of hundreds.
- It's equally inefficient and resource-draining for my destination. I would have been much better off loading few big chunks of data rather than hundreds of tiny ones.
I was wondering whether something similar to this would be easily hackable (I'm pretty sure it's not possible by default) in dlt and I'm looking for some ideas and hints. If there are any upcoming features planned that would make this easier, I'd love to get to know those too.