gargantuan-takeout-rocket icon indicating copy to clipboard operation
gargantuan-takeout-rocket copied to clipboard

S3 Targets e.g. Cloudflare R2 for Staging, S3 for AWS Deep Archive, Backblaze for lukewarm, Wasabi for lukewarm, and etc.

Open nelsonjchen opened this issue 2 years ago • 12 comments

General Issues to tackle:

Targets:

  • Hot
    • R2
      • For Staging for local backup. No Archive Tier available unfortunately. Lifecycle rules. Will be personally using for local backup staging if available. $15/mo/TB
      • Real and Authentic Free Download/Upload
  • Lukewarm
  • Cold
    • S3
      • For Deep Archive Tier (Equivalent Pricing to Azure)
      • Probably more comfy for AWS-natives
      • $1/mo/TB

nelsonjchen avatar Feb 18 '23 00:02 nelsonjchen

What is the reason for waiting for Lifecycle rules ? These destinations are quite cheap as they are no ? Also, uploading to R2 seems like a small step as CF proxy is used anyway no ?

mderazon avatar Mar 05 '23 20:03 mderazon

The biggest concern there for me is that it'll cost $15 to host 1TB of data on R2. That blows my budget by quite a lot. I want to make sure Cloudflare has some safeguards that a guide can guide to setup to prevent that in case someone forgets to delete their staging area.

nelsonjchen avatar Mar 05 '23 20:03 nelsonjchen

I fleshed out the issue description a lot @mderazon .

nelsonjchen avatar Mar 05 '23 20:03 nelsonjchen

fwiw, this is what I was trying to do with Workers: https://community.cloudflare.com/t/backup-directly-from-google-drive-to-r2/440132/5

mderazon avatar Mar 05 '23 22:03 mderazon

fwiw, this is what I was trying to do with Workers: https://community.cloudflare.com/t/backup-directly-from-google-drive-to-r2/440132/5

Hmm, that's such a weird usage of some APIs. You pass in a body which is just a ReadableStream, but then there's also queue size and part size. Doesn't that require some sort of seekable buffer or something? Maybe it blew up because those aren't compatible things you can do with a simple byte stream or a representation of a byte stream.

nelsonjchen avatar Mar 05 '23 22:03 nelsonjchen

You're doing a lot more orchestration in the worker than what I did my approach as well. In the prototype GTR Azure Transload from Cloudware Workers where the worker itself does the transloading, a lot of the orchestration happens on the extension, where it isn't bound by the silly 10ms CPU limit. The worker or the many worker instances really is just pretty much given two fetches, a response body from one to stick into the other, and no fat libraries doing stuff like part size and queues are used; the worker stays very dumb.

nelsonjchen avatar Mar 06 '23 00:03 nelsonjchen

On that note about fat libraries, if I do try to tackle this, I'll probably be using https://github.com/mhart/aws4fetch and maybe just the raw stuff in there.

nelsonjchen avatar Mar 06 '23 00:03 nelsonjchen

I don't think the size of the library makes any difference, as it could be one line in the library that does some CPU and that would be it. In the case of the library I used, the culprit might be somewhere around these lines of code https://github.com/aws/aws-sdk-js-v3/blob/ce7cc58b15fd7ba0bd2b10c7a471b4c8ce95b7d9/lib/lib-storage/src/Upload.ts#L309-L355

There's also this: https://community.cloudflare.com/t/streaming-large-remote-files/14501/3

I will try the lib you mentioned in my code to see if it makes a difference

mderazon avatar Mar 06 '23 08:03 mderazon

Just noting this down here: https://developers.cloudflare.com/workers/platform/limits/#simultaneous-open-connections

There is a limit of 6 simultaneous connections. Theoretically, I can do 3/10s the speed of the current Azure transloading from one worker call.

nelsonjchen avatar Mar 17 '23 05:03 nelsonjchen

https://developers.cloudflare.com/r2/buckets/object-lifecycles/

lifecycle rules have been added

nelsonjchen avatar May 05 '23 15:05 nelsonjchen

I'm keeping an eye on this project, wanted to ask, now that lifecycle rules have been added, the last missing piece to send it to any S3 compatible storage is remote fetch feature that Azure storage has ?

mderazon avatar Sep 20 '23 12:09 mderazon

The last missing piece is acceptable performance. The 100MB POST limit inside workers was extremely annoying. Is it still there? It cuts the speed to a top speed of 3/10s of Azure's and causes request count to spike to the point where it smashes into the free account limit's ceiling.

I haven't touched this issue in some time, I might resurrect it now that I got a new 8 TB drive to backup to.

nelsonjchen avatar Sep 20 '23 14:09 nelsonjchen