tremor-runtime
tremor-runtime copied to clipboard
Support GCS resumable uploads
Describe the problem you are trying to solve
When backing up events to GCS it is common practice to store files that contain data aggregated by some unit of time, be it day or hour. A common object name format is something like, which is also supported by big data tools like hive or spark:
some_prefix/year=2021/month=6/day=9/hour=11/data.json
or similar schemes.
To upload events to a GCS object we'd need to use the following offramp command:
https://docs.tremor.rs/artefacts/offramps/#upload_object
the "normal" GCS upload_object API requires the whole object to be uploaded with 1 request. To support aggregated files, this would require to buffer all events inside a tumbling window or a batch, which is dangerous as it could blow up memory depending on the volume of events for the aggregation period.
The problem we are trying to solve is to stream data to GCS objects in the aforementioned format without locally buffering the whole object.
Describe the solution you'd like
Luckily the GCS API supports resumable uploads where data is distributed across multiple requests:
https://cloud.google.com/storage/docs/resumable-uploads
We would like to support resumable uploads in the GCS offramp.
For this we would need to add a marker to an upload_object command that this is the last upload for this object. For cases where the size is not known (e.g. time wise aggregation where the name of the object changes because of a timestamp) on the last payload, we could also add such a marker on the next payload.
If possible such uploads should work across tremor restarts, so that necessary state kept around in the offramp for resumable uploads needs to be persisted.
cc @jigyasak05 FYI 😎
Nice, I can look into this! :D
Though I am happy that you wanna look into this, dont feel like you have to. I just wanted to show you this issue as you're author of this offramp. This can get hairy to test and stuff. Are you sure? ;-)
Haha! I can try :-D