terracotta icon indicating copy to clipboard operation
terracotta copied to clipboard

Figure out request authentication

Open dionhaefner opened this issue 6 years ago • 20 comments

dionhaefner avatar Jul 09 '18 14:07 dionhaefner

OAuth 2.0 seems to be the standard for APIs that require authentication. Something like this maybe: https://docs.authlib.org/en/latest/flask/2/

mrpgraae avatar Dec 16 '18 15:12 mrpgraae

Good point. The initial issue was mostly about Lambda deployments though (in case you want to serve sensitive data). I guess there are different options (whitelist certain hosts, sign requests somehow, or implement some session mechanism via the database), but I don't know which would be easiest.

dionhaefner avatar Dec 16 '18 16:12 dionhaefner

I think the easiest and also the most common mechanism, is a session cookie, stored in the database and passed by the user as a query param on every request and then compared with the one in the db. This of course requires https to be secure, but that should be standard anyways.

This is also simple to implement, which I think is very important for security stuff. Implementing our own security already seems a bit scary to me :smile:

mrpgraae avatar Dec 16 '18 21:12 mrpgraae

Then the actual mechanism for handing out the session cookie (which I think is a bit more tricky to get right, security-wise) can be totally independent of the given deployment and also of Terracotta, as long as the session ends up in the TC database somehow.

mrpgraae avatar Dec 16 '18 21:12 mrpgraae

I'm a bit reluctant to put runtime state into the driver DB. If we can we should explore other options first that don't require changes to Terracotta. I.e., use the appropriate AWS and web server functionality to prevent unauthorized requests from reaching Terracotta in the first place.

dionhaefner avatar Dec 17 '18 08:12 dionhaefner

For now there is always the option of hiding Terracotta behind a web proxy that takes care of authentication.

mrpgraae avatar Dec 17 '18 09:12 mrpgraae

That's definitely part of the "figure out" I meant in the title. It was mostly about exploring a few options to the point where we can give a recommendation.

dionhaefner avatar Dec 17 '18 09:12 dionhaefner

This has become relevant again in GRAS. The way I see it, there are 2 directions we can go:

External to TC

The user can set up a web proxy for self-hosted deployments that takes care of authentication. For Lambda, they can use this https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-use-lambda-authorizer.html

pros:

  • Probably very secure
  • External to TC, so not maintained / implemented by us
  • Zappa seems to easily support Lambda authorizer https://github.com/Miserlou/Zappa#api-gateway-lambda-authorizers

cons:

  • Less user friendly. The user might have to do a more work themselves to set this up.

Internal to TC

Some form of token-based auth. We could for example do something like this https://stackoverflow.com/questions/32510290/how-do-you-implement-token-authentication-in-flask

pros:

  • More user friendly, we can make some nice API to this

cons:

  • Less secure, could go wrong and then we're responsible
  • Maintained by us
  • User still has to manage user registration / deactivation

mrpgraae avatar Jan 29 '19 11:01 mrpgraae

I don't think the Flask authentication works across different workers.

More cons to the internal method:

  • We would need mandatory SSL to prevent credentials from being sniffed
  • Needs an additional data storage for credentials and / or session tokens

dionhaefner avatar Jan 29 '19 14:01 dionhaefner

I don't think the Flask authentication works across different workers.

Good point. The token would have to be stored in a db instead.

We would need mandatory SSL to prevent credentials from being sniffed

Yes, https would be mandatory, but the same is true of the external method.

That being said, I am also (i suspect you are as well) leaning heavily towards to the external solution. We can maybe use some Zappa magic to make it easier for users to use authentication on Lambda.

mrpgraae avatar Jan 29 '19 16:01 mrpgraae

We can maybe use some Zappa magic to make it easier for users to use authentication on Lambda.

https://github.com/Miserlou/Zappa#api-gateway-lambda-authorizers

j08lue avatar Feb 03 '20 13:02 j08lue

Reviving this issue, as I came up with an idea for authentication that might actually make sense to implement into Terracotta itself. @dionhaefner @j08lue @nickeopti let me know what you think.

JWT Token

The client attaches a JWT bearer token to the Authorization header of every request to Terracotta

Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJwYXRoIjoiL3JlZmxlY3RhbmNlcy8yMDE4MDUyNCIsImlhdCI6MTUxNjIzOTAyMiwiZXhwIjoxNjQ1MTc1MjE2fQ.VYvqBE9sZifgZheZ7LRzB6pNdy6VhxB8tA0iUJbJ204

Taking apart this token, we see that the payload looks like this

{
  "path": "/reflectances/20180524",
  "iat": 1516239022,
  "exp": 1645175216
}

If the path in the token is a prefix of the dataset that the client is trying to access (e.g. in this case the client might be querying /singleband/reflectances/20180524/B04) then Terracotta should authorize the request. If the path had been just "/reflectances" then Terracotta would authorize any dataset under the reflectances root. A path of "/" means the user should be authorized to access any dataset.

"iat" and "exp" are the usual JWT fields. Terracotta must obey the "exp" field.

The effect of this authorization should be such that datasets that the client is not authorized to access, should effectively be invisible to the client, so the authorization should be effective on all endpoints that would reveal the existence of a dataset. As far as I can tell, this would imply every endpoint except /keys and /colormap.

Any request to access a dataset without a matching token, should result in a 403 response. Any request containing an invalid token, in the sense that the signature is bad, the token has expired or the token is missing the "path" field, should result in a 401 response.

Obtaining the Token

How the client obtains the token is completely outside the scope of Terracotta itself. For the typical web client situation, one could imagine that Terracotta would be embedded into a larger platform, where the actual authorization process of determining if a user should be given a particular token, would take place.

The only requirement is of course that the token should be signed with the same secret that Terracotta uses to verify the signature.

Storing the Secret

The JWT signature secret could simply be stored in the TOML config file, but we could consider supporting things like AWS Secrets Manager as that could make it easier for users to sync the secret between Terracotta and the service that grants the tokens.

Configuring Authentication

The only reasonable way to configure authentication, that I can see, would be an all-or-nothing approach. So if authentication is enabled, it will be enabled for all datasets. For such a deployment, the usual behavior can be recovered by giving the client a token with "path": "/". Whether authentication is enabled or not could be determined by the presence of the secret in the config. Or if we decide to support more advanced configurations, such as AWS Secrets Manager, we should probably add an authentication subsection in the config.

Implementation

Everything suggested here is backward compatible with the current version.

No schema changes would be necessary.

There is no I/O or expensive computation involved in the authentication process, so any overhead should be negligible.

For almost all routes, the implementation itself seems trivial to me. It's basically a matter of validating the token and then comparing the "path" field to the key part of the route being called.

The only exception that I can see, is /datasets. Here, the given token would imply additional constraints than the one being given by the keys query arg. For example, if the "path" in the token is "/reflectances/20180524", this would constraints of reflectances for the type key and 20180524 for the date key regardless of what the client has specified for the keys query arg. A conflict between the keys query arg and the token should probably result in a 403 response.

mrpgraae avatar Feb 17 '22 11:02 mrpgraae

Actually, there is a reasonable way to implement authentication for only some datasets. We could add a config variable that looks something like

secret_paths = [
  "/analytics/private",
  "/platform/users"
]

That way, access to datasets under those paths would require tokens that authorize access. Examples of tokens that should give access to these paths

/
/analytics
/analytics/private
/analytics/private/ndvi

/
/platform
/platform/users
/platform/users/1234

But of course, a token with /platform/users/4321 does not give access to /platform/users/1234. Similarly, a path like /analytics/public would not enforce any authentication at all with this configuration.

mrpgraae avatar Feb 17 '22 12:02 mrpgraae

It looks a bit icky to me because this depends on key order to enforce permissions. And is this really easier than handling it from the outside? Should be simple enough to achieve the same thing.

dionhaefner avatar Feb 21 '22 11:02 dionhaefner

It looks a bit icky to me because this depends on key order to enforce permissions

But requests already depend on key order? The fact that keys have a certain order/hierarchy is already an established fact in Terracotta and a solid design choice that I don't think is likely to ever change.

And is this really easier than handling it from the outside? Should be simple enough to achieve the same thing.

Well, it's one less thing that users would have to implement themselves. Also, I think that even though some users know that they will need authentication, they might not have any idea about how they should implement it. So just the fact that there is a scheme that users can pick up and follow could be a big help here.

I wanted to suggest this scheme because I think when it comes to the general philosophy of Terracotta, it ticks all the boxes. It follows the key API, so doesn't make any assumptions about how the user has organized their data, it's lightweight and it's simple to use (and implement). So if there would be an auth scheme that we think could make sense to implement in Terracotta, I think it would be hard to find another scheme that does it better.

But I also understand, and to some extent still feel, that this is outside the scope of Terracotta itself and should just be implemented externally. If we don't want to make this a part of Terracotta, then I'll probably still implement this exact same scheme, or something very close to it, as an external solution. I will make a tutorial which we can put in the official docs and promote in the README.

mrpgraae avatar Feb 21 '22 11:02 mrpgraae

But requests already depend on key order? The fact that keys have a certain order/hierarchy is already an established fact in Terracotta and a solid design choice that I don't think is likely to ever change.

I meant that it forces users to treat the first key(s) specially, which quickly degenerates to treating the first key as user_id. In this case both the first and last key would have special meaning. It feels like, for complicated scenarios, it would be preferrable to have a mapping user_id: [allowed_datasets] which wouldn't be possible in this proposal.

But I also understand, and to some extent still feel, that this is outside the scope of Terracotta itself and should just be implemented externally. If we don't want to make this a part of Terracotta, then I'll probably still implement this exact same scheme, or something very close to it, as an external solution. I will make a tutorial which we can put in the official docs and promote in the README.

That sounds like a great compromise. We also toyed around with the idea to have a plugin system, so that might be a good time to look into that.

dionhaefner avatar Feb 21 '22 12:02 dionhaefner

I meant that it forces users to treat the first key(s) specially, which quickly degenerates to treating the first key as user_id.

Yes, I see your point. In the case where you just want per-user auth it would definitely degenerate to this. But there is still some flexibility. Since keys are just strings, I could imagine a setup where you have key names like ("scope", "sensor", "date")

Then you can still have values like

1234/s2/20220221
public/s2/20220221

Where public are datasets that anyone could access and an integer would indicate that only that user id can access it. So while that key now only exists to facilitate authentication, it's more flexible than just degenerating to user_id. But if we feel that it's fundamentally awkward that key order and auth logic gets mixed up, then yes you are right that it's icky.

It feels like, for complicated scenarios, it would be preferrable to have a mapping user_id: [allowed_datasets] which wouldn't be possible in this proposal.

Yes but that's definitely too complex to implement into Terracotta itself, IMO.

mrpgraae avatar Feb 21 '22 12:02 mrpgraae

Using the first key as some kind of scope ID sounds familiar, right @panakouris? In an application, we combined that with general authentication of requests to Terracotta (via Auth0 tokens). We did not prohibit access to other user's datasets, but the scope IDs were UUIDs, so there was at least no guessing...

However, I agree that it feels wrong to use the key structure programmatically for dataset access limitation.

Is the user -> datasets mapping really too complex to implement? It would need to be handled via the database and be optional, of course.

In any case, I like the idea of using a JWT that is validated with a secret stored in Terracotta, because that gives virtually no performance hit.

j08lue avatar Feb 26 '22 19:02 j08lue

Is the user -> datasets mapping really too complex to implement? It would need to be handled via the database and be optional, of course.

In any case, I like the idea of using a JWT that is validated with a secret stored in Terracotta, because that gives virtually no performance hit.

What I mean by too complex, is that I can't see how you can have both of these things that you mention. An arbitrary user -> dataset mapping implies a DB lookup.

I like the pure JWT solution because it removes a lot of complexity from the implementation (no DB support / schema change) and its as low-overhead as it gets.

mrpgraae avatar Mar 07 '22 08:03 mrpgraae

But it's entirely possible to implement externally or as a plugin, so that's what we will do I think :slightly_smiling_face:

mrpgraae avatar Mar 07 '22 08:03 mrpgraae