warehouse
warehouse copied to clipboard
Add a Permission caveat to restrict Macaroons to permissions
This PR removes (mostly) the hardcoded check for the upload permission, and instead moves it into a "Permission" caveat, the semantics of which is that the permission that is being asked for must be one of the ones mentioned in that caveat.
To ensure that we don't create tokens that can expand in scope as we add new API endpoints, the caveat does not have a way to say "all" permissions or to use any sort of wild cards, each permission must be explicitly enumerated AND we require that all new macaroons include a permission caveat (though we don't make any requirements on what those permissions it limits the macaroon to are).
To support the existing Macaroons, we add a column to the Macaroon database table which records whether this particular Macaroon predates the permission caveat or not, and if it does predate it we do a manual check that the permission is upload. If we ever force everyone to roll new tokens, we could eliminate this manual check, but for now it is easy to have to stay [^1].
The biggest question left I think is how do we want to serialize permissions? Right now this PR just serializes them as their string value, which means that our internal permission strings become part of our public API and we can't refactor them without ensuring that the old permission strings still work. This also is one of the least space efficient way of serializing these, since we waste a minimum 2 bytes, and by their nature strings are longer.
We could serialize them to integers, and we just assign an integer per scope, which means that only the integer value becomes part of our public API and we can pack a lot more permissions into a smaller serialized structure.
Throwing this up here now though to get some eyes on it, overall this should be working, though I want to do some more manual testing as well (on top of the open question).
[^1]: We do this rather than detect the presence of a Permission caveat, because anyone with a token can add a Permission caveat, which means in theory they could expand the scope of a legacy token from upload only to something else, but the same is not true for a database column.
Note: I'm thinking we may want to use integers for permissions here, and an explicit registry of permission <-> integer which will also allow us to limit which permissions are even valid in a macaroon.
Ok, after thinking about this more, I'm pretty sold on the idea of serializing our permissions to integers instead of as strings, which gives us some benefits:
- Much more compact encoding of permissions into Macaroons, helping to keep the size of our Macaroons down.
- Decouples the publicly available value for a permission from our internal value, allowing us to continue to evolve them over time as well as decide which permissions are allowed to be used within a Macaroon at all [^1] .
- This also means that any library that allows end users to introspect and/or add their own caveats can choose their own meaningful names for these permissions.
This does have two downsides (one of which is the same downside we've chosen to live with for all of our caveat serialization decisions):
- The serialized structure of the caveat becomes much more difficult to understand without tools to convert the "machine" readable format into a human readable format.
- We need to add explicit support for (de)serialization of a list of permissions (or at least, a list of strings) into a list of integers or similar.
I think that tradeoff makes sense (particularly the first one, since it's the trade off we've made repeatedly for Macaroon serialization).
An interesting open question still is how exactly do we want to serialize permissions as integer(s).
The simplest option is to just map a permission string to an integer value, and when you get a list of permission strings you just turn that into a list of integers and serialize that (or reverse it for deserialization). That works, but it has some overhead, you need at least 2 bytes for the [] and then an extra byte for each additional permission in the list for the ,. For the actual values themselves, the first 10 permissions (specifically, the first 10 permissions exposed via Macaroons) can be represented with 1 byte (0-9), the next ~89 with 2 bytes (10-99), and the next ~899 with 3 bytes (100-999).
The other option, is we don't have to model our list of permissions as actually a list of permissions, but rather we could model it as a series of boolean flags, which we can compactly encode into a single integer using bitwise math, which lets us encode up to 100 individual permissions in 31 bytes of JSON (vs 291 bytes for list of integers).
Immediately we can see that treating our permissions as bitflags minimizes the maximum amount of storage space by quite a lot (list of integers has almost 10x as much space used), but the trick is that is for a "worst case" scenario where we have 100 permissions exposed and all 100 are being included on this token.
If we only want to encode a single permission, things get a lot murkier.
For the list of integers case it will consume 2 bytes for the [] and then 1-2 bytes for the actual value, so a minimum of 3 bytes and a maximum of 4 bytes, depending on whether the permission is a single digit integer or a double double integer.
For the bitflags case, it depends entirely on which bit corresponds to the permission we want to set. If we have 100 permissions, but the permission we want to set is in the first 3 bits, then we only require a single byte, on the flip side if the permission we want to set is on the 100th bit, then we will require 31 bytes, even for a single permission.
I'm leaning towards using bitflags being the right tradeoff to represent our permissions when serialized in a Macaroon caveat, for the following reasons:
- The most popular permission is almost certainly going to be upload, and it's likely that upload will also be the permission that is most likely to be the sole permission on a token. If we make sure that the upload permission is within the first 3, then our common case will be limited to 1 byte.
- It may even make sense to reserve a few early bits for "likely to be used a lot" permissions like upload.
- The best case for list of integers is when you have a large number of permissions exposed and you want your token to only be valid for 1 of them, which has a best case of 3-4 bytes. That same 3-4 bytes could represent up to 13 permissions in a bitflag, and I think that it's unlikely we end up exposing a lot of permissions via Macaroons.
- The worst case for list of integers is when you have a large number of permissions exposed and you want your token to be valid for all of them, using the 13 number from above, the worst case for list of integers would be 30 bytes (vs the 4 bytes for bitwise).
Essentially the choice between bitflags and list of integers is that by choosing bitflags we drastically limit how bad the worst case is, but we also make it more likely that we'll hit that worst case (which can be mitigated by careful planning), and I think minimizing the worst case, paired with the mitigations we can do for common scenarios, is going to be our best path forward.
There's a third option here, which is we don't actually have to pick between list of integers and bitflags, and we can support both, and if we support both we could actually have the caveat serialization choose whichever one results in a smaller serialized structure (at the expense of needing to run serialization twice or needing to program in rules for when one should be preferred over another). I think that we can punt on that for now, as we currently only expose a single permission, and it's not until we get into 14+ permissions that bitflags isn't just universally better, and we can always add it in later.
So I'm going to move forward with a bitflags based implementation for now.
[^1]: Strictly speaking, this doesn't have anything to do with the choice of integer, and everything to do with the choice of having a value for permission within the caveat that is distinct from our internal value for a permission, and having the (de)serialization handle translating between them, but if we're using string based values inside the caveat, the most logical value to use is one that matches our internal representation or else we risk making things a lot more confusing.
I've got this in a state locally where it uses a bitflags (via enum.IntFlag) and I'm pretty happy with it, other than the fact that there's a small API thing that I'm not a huge fan of.
Basically as it stands we treat creating a caveat through deserialization and through instantiating the class as the same thing, and they support the same input types. What I want to do is have Permission(permissions=["upload"]) automatically create the proper IntFlags, so that the private/public permission split is kept internal to the caveat. However the way pydantic and our (de)serialization works doing so means that technically you can use ["upload"] in the caveat instead of the integer and it works.
I'm thinking of ways around this, but the answer might be to refactor caveats away from using pydantic and to use something else. That shouldn't be a huge refactor/change and comes with it's own benefit (one of the things about pydantic is they are very ideological that "validation" means "the output shape is what you declare it to be" not "the input shape is validated" and do a lot of coercing to make it so. We turn some of that off through using StrictInt and StrictStr but fundamentally they have a lot of it tht simply can't be turned off because they aren't trying to make something that validates incoming bytes on a wire in that way I thought they were when I originally started using them.
This overall approach makes sense to me, and is much cleaner! I agree 100% with the int enum/flags rationale.
An adjacent thought: if we're continuing to put pressure on token sizes, maybe we use this opportunity to also change the envelope format itself? In particular, using either msgpack, CBOR, or similar instead of JSON would also save us some bytes here.
In particular, using either msgpack, CBOR, or similar instead of JSON would also save us some bytes here.
When I made the Caveats v2 branch I tried message pack but pymacaroons had a hard check that the serialized caveat was valid utf8 which severely limited our ability to use a more compact encoding... but I was looking the other day and it looks like maybe that is gone now?
Of course we have the problem of reliably detecting a json encoded caveat vs a msgpack encoded one, but we could just do something smarter for the future and treat the first bite as a serialization version, and if it's [ or { it's JSON and if it's some other constant we dispatch to msgpack or whatever. I'm totally down to do all of that, but I think the question of swapping from JSON to something else is probably best to happen in a dedicated issue.
Nevermind, it's still here: https://github.com/ecordell/pymacaroons/blob/master/pymacaroons/caveat_delegates/first_party.py#L23-L26