data-multi-subject icon indicating copy to clipboard operation
data-multi-subject copied to clipboard

Consider moving our storage from AWS to Digital Alliance

Open jcohenadad opened this issue 1 year ago • 20 comments

The spine-generic dataset is being increasingly downloaded, which comes at a cost. For example, the cost for 2023 was 478$, which is not negligible (cost per month below):

>>> 34+33+26+24+37+37+40+90+32+24+45+56
478

I'm wondering how feasible/difficult it would be to move the git-annex server to a Digital Alliance cloud?

jcohenadad avatar Apr 03 '24 16:04 jcohenadad

I can't remember off the top of my head what Digital Alliance's policy is for this but I know I'd talked to Nick about it before. My memory is that they're happy to supply the bandwidth but I could be mistaken.

I'll submit a ticket about it now.

namgo avatar Apr 03 '24 16:04 namgo

It should be possible, but I'll just point out a technical difference:

  • For our internal dataset server, and for spineimage.ca, we're running NeuroGitea, which has both the repository metadata (file names, directories, commits, etc.) and the large data files (the actual .nii.gz files).
  • For spine-generic, we still want to host the repository metadata on Github. It's just the large data files that we want to host on the Digital Alliance cloud, instead of on Amazon S3.

So, it's not exactly the same interface/setup. But, I'm sure there exists open source software which can present the same interface as S3. Or, we can also look at the other types of special remotes that git-annex supports.

mguaypaq avatar Apr 03 '24 18:04 mguaypaq

Thank you for the clarification @mguaypaq. Would it make sense to move spine-generic to a NeuroGitea server?

jcohenadad avatar Apr 03 '24 18:04 jcohenadad

I think Github still gives us a lot of value:

  • All the existing links keep working.
  • All the issues/pull requests are there.
  • We don't have to manage user accounts for every collaborator.

Probably we just want to change the storage backend for the large data files, that should be a much smaller change from the point of view of our users. It should be very doable.

mguaypaq avatar Apr 03 '24 19:04 mguaypaq

In particular, it looks like the Digital Alliance already has a service that's compatible with S3, so maybe this can be easy: https://docs.alliancecan.ca/wiki/Arbutus_object_storage

mguaypaq avatar Apr 03 '24 20:04 mguaypaq

I heard back from Digital Alliance on Arbutus (way quicker than I expected... 30 minute turnaround time!).

There's no specific policy about using arbutus for public datasets but the bandwidth they can provide is pretty limited - unless it's primarily our own users who are downloading the dataset? They have fast uplinks to Canadian university networks, but relatively slow uplinks to everyone else from the sounds of things?

Do we know how many external people are using the dataset?

namgo avatar Apr 04 '24 01:04 namgo

Do we know how many external people are using the dataset?

i'd say ~10 ppl/month? but uplinks doesn't matter too much. What matters is downlinks. And as long as it is not insanely slow (which is not), we should be fine

jcohenadad avatar Apr 04 '24 02:04 jcohenadad

I mean uplink in this case would mean their upload speed to external networks which would affect download speed but:

Well the CANARIE research backbone that connects all of Canada together has some pretty good interconnections with Internet2 and GEANT at the very least, so that should cover the USA and most of Europe.

I think that fits our needs, very cool!

@mguaypaq do you have a vision for how you might switch backends in git-annex? I imagine we can just import the dataset from s3 to arbutus without much trouble but I'm not sure about how this works with git-annex.

namgo avatar Apr 04 '24 16:04 namgo

Just like git, git-annex supports having multiple remotes. So, I imagine we would:

  1. Figure out the right configuration and permissions for Arbutus object storage. (Probably you @namgo? Although maybe I can help since I have admin access to the existing Amazon S3 stuff.)
  2. Configure Arbutus as a new special remote for git-annex, alongside the existing Amazon special remote. (Probably me, since I'm most familiar with git-annex.)
  3. Copy over the files through git-annex, which will double as a test that new files can be uploaded by people with write access.
  4. Test that a new clone can get the files from the Arbutus special remote.
  5. Deconfigure the Amazon special remote from git-annex (but keep the files in place for a little while).
  6. Once everything works, delete the Amazon buckets.

So, a nice gradual transition, with plenty of opportunity to roll back if there are problems.

mguaypaq avatar Apr 04 '24 17:04 mguaypaq

@nullnik-0 is becoming our resident expert on ComputeCanada already! I'd be down to work alongside her since I have admin perms on our CC projects. I'll loop all three of us into a slack convo and we could talk about permissions.

namgo avatar Apr 04 '24 19:04 namgo

Preliminary tests seem to work! We should be able to migrate to Arbutus object storage fairly quickly and reduce our Amazon bandwidth costs.

Steps:

  • Following the Alliance docs for Arbutus object storage, I downloaded and sourced my OpenStack RC from the API access dashboard. Then I ran

    openstack ec2 credentials create
    

    and saved the resulting access key and secret key in my password manager.

  • In a fresh clone of spine-generic/data-single-subject, I created two special remotes:

    • arbutus-read: this one is world-readable and auto-enabled.

    • arbutus-write: this one has to be enabled manually, and can be used by people with ec2 credentials (like me from the previous point) to upload image files. Git-annex knows that it refers to the same bucket as arbutus-read.

    read -r AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    # copy-paste the access key and secret key, separated by a space, then press enter
    export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    env | grep AWS_
    
    git annex initremote arbutus-read type=S3 \
      autoenable=true \
      bucket=def-jcohen-data-single-subject \
      datacenter=CA \
      encryption=none \
      host=object-arbutus.cloud.computecanada.ca \
      port=443 \
      protocol=https \
      public=yes \
      publicurl=https://object-arbutus.cloud.computecanada.ca/def-jcohen-data-single-subject/ \
      requeststyle=path
    
    git annex initremote --sameas=arbutus-read arbutus-write type=S3 \
      bucket=def-jcohen-data-single-subject \
      datacenter=CA \
      host=object-arbutus.cloud.computecanada.ca \
      port=443 \
      protocol=https \
      public=no \
      requeststyle=path
    
  • Get some data files and push them to Arbutus:

    git annex get sub-douglas
    git annex copy --to arbutus-write sub-douglas
    
  • In a clone-of-my-clone, try to get the files, without having ec2 credentials:

    unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    git annex get --from arbutus-read sub-douglas
    

I'm out of time for this week, but next week I'll try to migrate both data-single-subject and data-multi-subject to Arbutus.

mguaypaq avatar Apr 04 '24 22:04 mguaypaq

Stumbling on that convo... You do not necessarily need to decommission the AWS storage as you can manage which special-remote gets chosen by default (while keeping others as a fallback) by setting a cost value to the special remote when initializing it (or adding it with enableremote afterward ).

Also, I don't think you need to set 2 special remotes for read/write respectively, the first one should work for both (write with credentials only).

bpinsard avatar May 10 '24 14:05 bpinsard

thank you for your insights @bpinsard!

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

jcohenadad avatar May 11 '24 15:05 jcohenadad

@bpinsard have you gotten the single read/write remote to work in the past? I remember trying (in the past year) to use a single special remote for both, but couldn't get it to work. Possibly something to do with this interaction between the config settings.

mguaypaq avatar May 13 '24 15:05 mguaypaq

We do not use that setup in production (only authenticated access), but I just tested it with a minio s3 server (so not the same as digital alliance). There is one caveat though (which I think occurs with the split setup too), is if someone has permanently setup AWS keys in their environment for another server/usage. In that case, it overrides the anonymous access and can cause 403 errors because git-annex sends the credentials (it might depend on how each server deals with policies and anonymous access).

I think a good way to avoid that, is to set the S3 remote for the read/write data management only, not autoenabled, and then add a httpalso sameas remote, crafting the https url depending on the server, bucket-name and requeststyle.

git annex initremote https_download --sameas=s3_remote_name autoenable=true type=httpalso url=https://s3.unf-montreal.ca/test.publicurl/ cost=50 

This can save a lot of user-support headaches.

bpinsard avatar May 13 '24 16:05 bpinsard

Oh! I didn't know about the httpalso remote type, that makes a lot of sense. It's still two remotes with a sameas, but probably with fewer corner cases.

mguaypaq avatar May 13 '24 17:05 mguaypaq

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

We have a backup of what's on gitea (and what was on gitolite) however those I understand to be git-annex archives rather than the datasets themselves.

namgo avatar May 13 '24 20:05 namgo

Whoops! I misunderstood - we don't have spine-generic backed up on restic - mathieu helped me remember that that one's on github.

namgo avatar May 13 '24 20:05 namgo

then we should probably create a backup, no?

jcohenadad avatar May 14 '24 13:05 jcohenadad

Good point. I made a ticket for getting this put in restic (with some questions for Mathieu) - should be pretty straightforward.

namgo avatar May 14 '24 14:05 namgo