physionet-build icon indicating copy to clipboard operation
physionet-build copied to clipboard

AWS identity verification

Open bemoody opened this issue 9 months ago • 11 comments

Context

Pull #2086 added the ability to mirror project content on Amazon S3. This is now working and we are in the process of uploading open-access projects from PhysioNet.

The changes here will be needed once we start uploading restricted/credentialed projects, so that we can securely grant access to authorized users. (Identity verification aside, there are also some more significant changes that are needed for handling restricted/credentialed projects; see issue #2094.)

In brief: currently (in the old system Felipe set up), people are asked to self-report their AWS account number, and any person or service within that account would be allowed to access restricted data.

With these changes, in contrast, people will be asked to verify their personal AWS identity; subsequently, we'll be able to grant access only to verified identities (the latter part is yet to be implemented.)

Why

DUAs for MIMIC and other databases require that data is only shared with authorized individuals (each person must register on PhysioNet and be credentialed.) We want to enable cloud access for better performance, but complying with these DUAs requires knowing who is being granted permission to use these cloud services.

Moreover, although each user is ultimately responsible for data security, we want to encourage good practices. People may be using AWS for all sorts of reasons unrelated to PhysioNet. Giving themselves permission to access MIMIC through their personal account should not also grant permission to all of those unrelated and possibly-less-trusted services.

Some people may be using organizational AWS accounts rather than personal ones. Maybe we want to discourage this, or maybe not, but we can't prevent it. One member of an organization having access shouldn't grant access to everyone in the organization.

There is a lot about AWS authentication that is still a bit mysterious to me, but my gut feeling is that the "IAM user" level is the right level of authentication for PhysioNet and MIMIC.

It has been suggested that we could ask people to self-report their AWS username (or ARN?) in addition to their account number. And yes, that would be an improvement; but it has the disadvantage that usernames are variable-length, and may not be long-term stable. Better would be to ask people to self-report their AWS userid, but that's not easy for people to find and more likely to cause mistakes.

Finally, I can imagine that in the future there may be other reasons for wanting to associate a PhysioNet account with an AWS account, and having a strong verification process could enable more interesting forms of integration.

How identity verification works

The concept is that we would have a special-purpose S3 bucket which allows access only if the path matches the requester's AWS account and userid. To prove your identity, you generate a signature for a URL that can only be accessed by you, and paste that signed URL into a form on the site.

The process would be:

  1. You go to your cloud settings page on PhysioNet.

  2. We tell you to run the command aws sts get-caller-identity.

  3. You copy the output into the form.

  4. We then tell you to run a command like aws s3 presign s3://asdfghjk/physionet.org-verification/[email protected]/userid=AIDAABCDEFGHIJKL/account=112233445566/username=barackobama/.

  5. You copy the output into the form.

  6. We verify the format of the URL and submit it to AWS to verify the signature.

Wait a minute, what's this "userid" thing you keep talking about?

https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-unique-ids

Setup and testing

Using this feature requires creating a special-purpose S3 bucket (a bucket which probably should not be used for anything else.)

For the time being, you can test this by setting AWS_VERIFICATION_BUCKET_NAME to bm-uverify-test1. I will delete that bucket once we've set up a permanent replacement under the PhysioNet AWS account.

If you want to see exactly how the verification bucket is created, and test it yourself, see the instructions in deploy/README.md.

Background

Although this implementation is guided by the needs of PhysioNet, my goal has been to design a general-purpose authentication protocol that could be used by any site that needs to verify cross-account AWS identities.

This is inspired in part by the technique used by Hashicorp Vault and discussed here:

  • https://ahermosilla.com/cloud/2020/11/17/leveraging-aws-signed-requests.html
  • https://www.hashicorp.com/resources/deep-dive-vault-aws-auth-backend

and similarly: https://stackoverflow.com/a/76099155

We could use the same method, but it would require the person to download and run a small program (and that program involves some pretty hairy digging into the AWS API.)

The method proposed here, in contrast, only requires the person to install the official AWS CLI and run a couple of commands. I think that this is easier to understand and therefore paradoxically more secure (see if you can spot the security flaw in the StackOverflow answer.)

For information about why this works, see AWS documentation on policy variables: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html

Also see the AWS CLI documentation: https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html

bemoody avatar Oct 24 '23 22:10 bemoody