htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Obtain S3 credentials via IAM roles

Open jmarshall opened this issue 8 years ago • 23 comments

In addition to the use case where a session token is provided as part of a Federated access model, probably the more common use case is for using IAM roles in combination with EC2 instances. Briefly, this makes temporary credentials available to the instance at a local IP address obviating the need to ever put long term security credentials on an instance. s3cmd supports this method of key delivery and will 'look' for the availability of these keys out of the box.

[As noted by @obenshaindw on PR #303]

jmarshall avatar Mar 02 '16 11:03 jmarshall

I see botocore also falls back to contacting the IAM well-known endpoint. HTSlib could also do this, although I would have thought there would be some common script for copying these into environment variables so that such code was not multiplied across clients and SDKs…? e.g. eval something like

baseurl='http://169.254.169.254/latest/meta-data/iam/security-credentials'
role=`curl $baseurl`
curl $baseurl/$role | jq -r "\"export AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\""

HTSlib-based tools (i.e. samtools) are used on EC2 and especially outside EC2. If you're not on EC2, contacting random link-local IP addresses and asking them for credentials would be rather unfortunate (to put it mildly), so I'd really rather not do this by default. I would have thought there would be a convention of only doing this when $AWS_PROFILE is set to a special :IAM value or something like that?

jmarshall avatar Mar 02 '16 11:03 jmarshall

Any script that reads the instance metadata for security credentials and sets environment variables for those would need to be run on a schedule or immediately prior to running samtools (at least every hour according to the previously linked page AWS documetation because credentials are automatically rotated.

The advantage here is not having to leave any credentials on the instance, and if using an application that reads directly from instance metadata, not having to set anything environment-specific.

I agree that this would seem like a special case for samtools (i.e., a piece of code that is environment-specific). It makes sense for boto to have this functionality since it is an SDK for Amazon Web Services. As you say, if one is using samtools on EC2 they could write a script to read from the instance metadata endpoint and place the security credentials in a location where samtools will find them.

obenshaindw avatar Mar 02 '16 17:03 obenshaindw

The AWS C++ SDK instead of the current customer request signing code would solve this. But that is a big project and would likely be better done by pulling out the htsfile_s3.c plugin from the root source tree and into it's own contrib project.

In the meantime, I've built a Docker container that has S3 protocol support and also a wrapper script that queries the metadata and then calls samtools. This should work for both EC2 instances and ECS Tasks. The wrapper script contents are simply:

#!/bin/bash
B='http://169.254.169.254/latest/meta-data/iam/security-credentials/'
CREDS=''
if [ -n "${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}" ]; then
    CREDS=$(curl -s --connect-timeout 0.1 -s 169.254.170.2${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI} |  jq -r "\"AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\"")
elif [ -n "$(R=$(curl -s --connect-timeout 0.1 $B))" ]; then
    CREDS=$(curl -s --connect-timeout 0.1 $B/$R |  jq -r "\"AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\"")
fi
eval ${CREDS} samtools-s3 $*

Docker container is at https://hub.docker.com/r/delagoya/samtools-ecs-s3/

I've tested it out on AWS Batch to pull the header from a private S3 BAM file object.

delagoya avatar Apr 27 '18 17:04 delagoya

Are there any updates to if S3 credentials via IAM roles will be considered? The workaround has trouble since the credentials are rotated so any jobs that run for more than a few minutes can suddenly fail in the middle.

hguturu avatar Apr 11 '20 21:04 hguturu

can you give an example of a failure? The code above should pull fresh credentials every time the docker container is executed.

delagoya avatar Apr 12 '20 16:04 delagoya

The curl command above curl -s --connect-timeout 0.1 $B/$R returns an object with a field such as "Expiration" : "2020-04-14T01:20:09Z". This expiration is a fixed time from when the credentials were last refreshed. So if the command is launched 10-15 minutes before the tokens expires and runs for more than 15 minutes, it breaks in the middle since its tokens have expired. I assume the SDK handles this by refreshing tokens as they expire. Its hard to reproduce since the command itself works fine, just the token refresh timing.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials seems to suggest new tokens are given 5 minutes in advance, so there is a tight window of failure cases, but the window gets longer when streaming large files from s3 (e.g. a 200Gb+ bam file).

hguturu avatar Apr 13 '20 20:04 hguturu

Is the problem caused by trying to do lots of index look-ups, or by having a very long delay between generating the credentials and actually trying to open the file?

The first one isn't too easy to fix, as currently htslib only reads the credentials on S3 file open (although it does know how to re-sign the request if necessary). The second problem might not be so difficult. Instead of passing the credentials in environment variables, as shown above, you could store them in one of the configuration files supported by hfile_s3. Then you could run a program that occasionally requests a new token and writes it into the configuration file (ensuring you do this safely, i.e. write to a new file and then rename it on top of the old one). That way, when hfile_s3 tries to get the file, it should always have a token with a reasonable amount of lifetime left.

daviesrob avatar Apr 14 '20 16:04 daviesrob

Ya its unfortunate the first type of use case doesn't have an easy fix with this workaround. I think the use case also applied with larger streams where they can be cut off midstream. But I guess the only fix would be to update hfile_s3 to obtain those credentials so it knows when they are invalid and refresh or switch to the sdk which can handle credentials for you, either of those are bigger tasks, which is why I was wondering if there are any plans for that update?

hguturu avatar Apr 15 '20 00:04 hguturu

Looking at this, I think the best solution would be to add the AWS JSON metadata format to the list of configuration files the hfile_s3 can use. It can then extract the expiration time and use it to renew the credentials if it needs to.

I'm not sure exactly when we can get it working, but we can certainly put it on out to-do list.

daviesrob avatar Apr 15 '20 11:04 daviesrob

It would be great if it gets added to the to-do list! Thanks!

hguturu avatar Apr 15 '20 17:04 hguturu

Hi, I'm curious if you guys still consider this. It would be very useful.

markotitel avatar Feb 25 '22 10:02 markotitel

It is still on the list of things to do. Unfortunately it is not that high up on the list at the moment.

whitwham avatar Feb 25 '22 11:02 whitwham

I just want to express my interest in this as well. samtools in aws batch + cromwell for genome data now requires a lot of localization (copying) that can be prevented if samtools could use the iam-role when accessing s3 data.

bioIT-UZA avatar Mar 10 '22 07:03 bioIT-UZA

When handling non-public data this value of this feature is substantial. it has been allowed to linger for more than 7 years and remains important, as demonstrated by a steady stream of comments and upvotes over the years.

I thought this solution from igv.js was notable https://github.com/igvteam/igv.js/issues/1709 and should be linked to in this thread. Besides it's technical value, its a clear message that other parts of the bioinformatics community rely on s3.

I'd like to request that the "Low Priority" tag be removed from this issue.

cariaso avatar Dec 06 '23 15:12 cariaso

Hmm, I obviously forgot to tag this ticket when working on PR #1462, which added support for refreshing credentials like the ones you get from IAM. It turns out that there are a few ways of getting credentials out of AWS, which all have subtle (or not-so-subtle) differences. Rather than try to support them all in HTSlib, it's left to an external script to get the credentials and save them in a format that HTSlib can use.

A simple IAM example can be found in the htslib-s3-plugin manual page, although for production use you might want something a bit more robust. Basically you just run it in the background and it grabs the IAM credentials and saves them in .aws/credentials format for use by HTSlib. It also adds a slightly-unofficial expiry_time key, which tells HTSlib when it may need to re-read the file. The script wakes up and replaces the file well before the expiry so that the stored credentials should always be fresh. (Note also that it takes care to replace the file atomically so that HTSlib will never see a half-written copy). Hopefully it shouldn't be too difficult to integrate something like this into your workflows.

daviesrob avatar Dec 07 '23 17:12 daviesrob

The simplest way for HTSLIB to do this would be to use the AWS C++ SDK, that way you get all of the logic of the AWS credential chain, sigV4 implementation and APIs that construct the calls to the service endpoints.

It would require some wrapper code to extern the relevant calls of the C++ lib but probably easier (and a lot safer) than attempting to implement the credential chain yourself. Example at, https://stackoverflow.com/questions/56842849/how-to-call-aws-cpp-sdk-functions-from-c

Use of a credentials file is not recommended for anything other than a local machine (laptop etc) and even then it should only be using temporary credentials. Use of a credentials file in a container, on an EC2 etc is not recommended, especially ones with long lived credentials. This is where Roles should be used exclusively.

markjschreiber avatar Jan 16 '24 15:01 markjschreiber

When this code was originally written, I investigated using the AWS SDK rather than the AWS protocol documentation. HTSlib is a C library used largely in environments without strong sysadmin skills or abilities to manage dependencies effectively. Adopting a multi-million line dependency written in C++ was a non-starter.

jmarshall avatar Jan 16 '24 21:01 jmarshall

I can see that, it will certainly add some overhead.

The use of the lib in environments without strong sysadmin skills is part of the reason that I am worried about the current requirement for credentials files. People may not be to aware of the risks they are taking especially as I know many users will probably be running with fairly privileged IAM Roles that are, or are close to, Admin.

I would imagine the most user friendly way to do it would be to install from a package manager https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/sdk-from-pm.html and only install the relevant service packages (to avoid the multimillion lines of code) and then check for them using your install script in the same way you do for other dependencies. Certainly requiring someone to build the lib along with samtools is not something most would want to attempt.

It's certainly not no-work but might be easier to maintain in the long term and gives you easy access to the full S3api and full access to all aspects of the credential chain which would make it easier to use in more exotic scenarios like supporting users with SSO provided credentials, containers on Kubernetes etc.

markjschreiber avatar Jan 16 '24 21:01 markjschreiber

…Or to get around to locally implementing IAM role parsing e.g. from a well-known endpoint, as discussed on #303. Implementing that is one of the reasons HTSlib acquired a JSON parser, and I may have a partial draft still lying around from 2016 somewhere…

jmarshall avatar Jan 16 '24 21:01 jmarshall

As the debate has been opened, my twopennies worth.

My heart sinks whenever I'm attempting to install something that has a dependency on vast C++ libraries. I don't know anything about the AWS one and I'm sure it's fine, but far too many are not and I've wasted hours battling cmake and its bizarre choices of which C++ compiler and library versions it wants to use. I seem to have a disproportionate rate of failure with such things. Yes I could install binaries instead, but that's not always an option depending on the environment, and we support a wide range of platforms for which binaries may not be available.

Htslib aims to be small and lightweight, with a minimal number of dependencies. Where required due to complexity it's fine (eg we're really not going to rewrite lzma or the various encryption libraries), but still we've limited ourselves to code written in the same language as htslib itself (even to the extent of porting a little bit here and there). I think this is sensible.

I've previously explored linking C++ from C and while it's possible and not even that hard for an executable, it has knock on effects to every user of a library, and their environments may well be very different. You may even need users of htslib to start linking with C++ too, if there are static classes with constructors in use. Note some use htslib in other libraries, and some end up in other language bindings. The downwards dependency chain is vast and by switching languages we're making demands on all of that dependency chain too. So it's a "write once, test everywhere" scenario that is simply best avoided.

We could migrate the S3 code into a plugin and avoid some of the language shenanigans, but that's also a bit messy and it feels a bit like the tail wagging the dog. In this case, the extra code to support this ourselves is potentially less of a maintenance burden than attempting to use the official library, so it's worth investigating that route first.

If you wish, consider this as feedback: C is much more portable than C++, so a lightweight cut-down AWS C API would have an appreciative audience.

Regarding security, I do agree that avoiding credentials files would be preferable.

jkbonfield avatar Jan 17 '24 09:01 jkbonfield

Actually, it turns out that there is a C library for authentication which I found two submodules down from the C++ API. Other C interfaces to AWS can be found here; the auth library depends on some of them which unfortunately makes it non-trivial to use. The Apache license everything comes under might cause us problems too, depending on how we linked everything together.

One of the reasons why I didn't go for reading the JSON directly was that there's now so many interfaces for getting credentials on AWS, which all work with different formats. This can be seen in the C library where functions can be found for all the different options. It seemed easier to delegate this complexity to an external program. I do note that there's a comment suggesting the more popular options so we might be able to get away with just concentrating on those.

It would also be fairly easy to make an external module that used the C++ API in the manner as the ones in htslib-plugins. If given a suitably high priority it would override the built-in S3 handler, allowing for an easy replacement.

daviesrob avatar Jan 17 '24 10:01 daviesrob

Apache 2 is fairly permissive so I think you will OK to link to it. It doesn't have the so called "viral" properties of GPL and the like. If you had to copy the code that part would likely need to remain Apache 2 so best as a module/ plugin.

I agree that the real goal is to find a way to have the most pain free way of dealing with the multitude of authorization options supported by AWS credentials. Probably that means delegating to an external implementation as long as it doesn't bring too many complexities. A plugin seems like a nice way to go and keeps the main library light weight for people who don't need S3 support.

Speaking from personal experience, being able to use HTSLIB and Samtools from an EC2 or container on ECS reading directly from S3 would solve a lot of headaches with staging/ unstaging objects to block storage.

markjschreiber avatar Jan 17 '24 14:01 markjschreiber