physionet-build S3: scrounging bandwidth for uploads

In pull #2086, one thing that worries me a little is that it will initially be very slow to upload projects. At present, the upload process would have to compete for bandwidth with all of the clients currently downloading data.

I can think of a couple of workarounds:

We could do the uploads from the backup (physionet-production) server. One advantage is that it's in a completely different physical location. However, it would be a messy manual process and we would probably need to manually update the database on physionet-live.
We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

One thing I don't want to do is to prioritize uploads over client requests.

Oct 10 '23 14:10 bemoody

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

Oct 10 '23 14:10 bemoody

It'd also be super nifty if we had a way to track the progress of background tasks in the admin console.

Good idea, this would be helpful!

Oct 10 '23 15:10 tompollard

Any thoughts on how we should manage the network traffic?

Uploading hundreds of projects will require some automation in any event. But I'd prefer to do so using Chrystinne's code rather than trying to script it in some other way.

Oct 10 '23 18:10 bemoody

Sorry, not my area of expertise! Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

Oct 10 '23 18:10 tompollard

Thinking about this a little more, my preference would be:

We could configure the S3 client to use an HTTP proxy (via a separate network link, albeit from the same building.) In fact we could set a single proxy server for everything (GCP, DataCite, ORCID, as well as AWS) but I think it might be preferable to configure S3 separately.

This seems like an approach that may be useful in the longer term (rather than a one-off, just for the initial batch of uploads to AWS).

Oct 10 '23 18:10 tompollard

Personally I think I'd just take a short term hit on our network, perhaps alongside a news item explaining why downloads are slow.

I don't think that's practical, though. When I say "competing for bandwidth", I mean that uploading to Amazon would be limited to the same speed as everyone else; uploading 30 TB would take months.

It's true that in theory we could monkey with the traffic control settings to prioritize certain connections over others, but that's difficult and finicky and I don't want to try to deal with it.

Oct 10 '23 19:10 bemoody

to set a custom proxy, something like this should work

--- a/physionet-django/project/cloud/s3.py
+++ b/physionet-django/project/cloud/s3.py
@@ -58,7 +58,13 @@ def create_s3_client():
         session = boto3.Session(
             profile_name=settings.AWS_PROFILE
         )
-        s3 = session.client("s3")
+        config = botocore.config.Config()
+        if settings.AWS_HTTP_PROXY:
+            config.proxies={
+                'http': settings.AWS_HTTP_PROXY,
+                'https': settings.AWS_HTTP_PROXY,
+            }
+        s3 = session.client("s3", config=config)
         return s3
     else:
         return None

https://stackoverflow.com/a/45492119

Oct 10 '23 21:10 bemoody