snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
RDB Loader: add possibility to enable transit load by default
We have a chance of race condition, breaking the load when two pipelines are involved. With current default behavior:
- Two pipelines Big and Small are loading data to same the DB
- Big starts at 0:00, Small starts at 0:15 and both have corresponding
etl_tstamps - Small finishes first and adds its
etl_tstampto Load Manifest - Then Big load starts and Loader checks last etl_tstamp in events and manifest. Finds out they're similar (but none of them is correct) and abort the job
Quick workaround is to skip manifest_check. Correct workaround would be to always enable transit load when two pipelines are involved (right now it gets enabled automatically only when --folder is passed)
/cc @stdfalse