garnet
garnet copied to clipboard
More Complete ACL Implementation
This sketches out a more proper ACL implementation - one with categories beyond @admin, and where individual commands could be ACL'd (current implementation does not do that, however).
Basic idea is to use RespCommandsInfo as a mechanism to discover command ACLs, and attach a bitfield to each User to track individual commands rather than categories.
Outstanding TODOs:
- [x] Test rest of commands w.r.t. ACL categories
- [x] Implement individual command ACL'ing (ie.
+get) - [ ] Implement sub-command ACL'ing (ie.
-config|get) - [x] Figure out what to do with unlisted/internal/cluster commands
- [x] Figure out what to do with custom extensions
- Introduced a @garnet ACL category that all commands not in Redis are now tagged with
- [x] (Probably) Remove
CommandCategory- it is redundant - [x] Clean up command parsing, now that subcommand bytes are a "later in the pipeline'-thing
- As part of this, probably DRY up ACL checking - it's a bit error prone to add a new command with this code so "wet"
SET(with EX, NX, etc. but notSETEX) are still special cased as they weren't realized as subCmds
- [ ] Cluster subcommands are weird, but need ACL testing too
... and a bunch of little ones in the code as comments.
This is explicitly not going to implement keyspace patterns. I feel like that should be a separate enhancement, as it's pretty substantial on it's own.
The biggest incidental change here is a substantial refactoring of RespCommand and the various switches over it. This is, IMO, the cleanest way to get a canonical value to test against ACLs with. As a consequence, RespCommand has a bunch more entries as most things that were previously "sub commands" (but not in the RESP sense) are now actual enum values.
There's still some more to do here (more testing, subcommands, some cleanup) , but I'm going on vacation for a week and this is pretty dang close to how it's going to actually look when done. So taking it off draft, and marking ready for review (if not merging).
/cc @mgravell @NickCraver - this is relevant to your interests.
I think this PR has too many changes to vet, maybe we need to break it into smaller PRs. I think the first is good but very hard to review because the surface area is huge. I think the idea of having a bitmap to match command permissions is good. There might I couple alternatives to tradeoff between CPU cycles or memory. We need to separate general fixes from clear ACL enchantments and avoid pushing parsing inside the execution methods.
Yeah, it's gigantic - ACLs touch literally everything. Excluding tests, it is trending smaller as things DRY up.
A bunch of random-ish fixes are somewhat unfortunate, as a big part of testing this is making sure that failed ACLs don't bork the command stream - and I encountered a number of cases of fairly "brittle" command implementations. I'm open to breaking those out into separate PRs once ACLs are in a good place, but I needed some confidence that the ACL design would work for all commands upfront.
Great work overall. Shaping up really nicely, thanks!
@lmaas @vazois @badrishc This is now in a complete state from my perspective, and ready for full review.
Benchmark (Release mode): Embedded.perftest --op-workload PING --op-percent 100 -t 1
Framework: net8.0
Branch: main (on Kevin's fork)
Running with 1 threads... Avg. throughput: 135719.46 Kops/sec; 135719.46 Kops/sec/thread (Max: 135719.46, Min: 135719.46)
Branch: aclImprovements (on Kevin's fork)
Running with 1 threads... Avg. throughput: 62371.18 Kops/sec; 62371.18 Kops/sec/thread (Max: 62371.18, Min: 62371.18)
Summary
Looks like we are at around 54% performance regression at the moment. Need to drill deeper into why.
Edit: looks like there is too much variance across runs so take above with a grain of salt.
Benchmark (Release mode):
Embedded.perftest --op-workload PING --op-percent 100 -t 1Framework: net8.0
Branch: main (on Kevin's fork)
Running with 1 threads... Avg. throughput: 135719.46 Kops/sec; 135719.46 Kops/sec/thread (Max: 135719.46, Min: 135719.46)
Branch: aclImprovements (on Kevin's fork)
I reproduce this regression on my latest too, I'll spend some time digging (now that all feedback is addressed) and update when I have an explanation/fix.
I reproduce this regression on my latest too, I'll spend some time digging (now that all feedback is addressed) and update when I have an explanation/fix.
- Is it possible something is inadvertently getting boxed?
- I made this into a BDN testcase here: https://github.com/microsoft/garnet/compare/main...badrishc/resp-parse-stress - here are the results comparing main to this PR:
main:
| Method | Job | EnvironmentVariables | Runtime | Mean | Error | StdDev |
|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | 4.292 us | 0.0446 us | 0.0395 us |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 2.449 us | 0.0052 us | 0.0040 us |
PR:
| Method | Job | EnvironmentVariables | Runtime | Mean | Error | StdDev |
|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | 4.937 us | 0.0462 us | 0.0409 us |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 3.958 us | 0.0443 us | 0.0393 us |
Is it possible something is inadvertently getting boxed? I made this into a BDN testcase here: https://github.com/microsoft/garnet/compare/main...badrishc/resp-parse-stress - here are the results comparing main to this PR:
Doesn't appear to be allocations based on local profiling. Might be useful to add [MemoryDiagnoser] to that benchmark, which I haven't had a chance to poke at yet.
Doing some quick and dirty benchmarking with the Embedded.perftest example, it looks like it's mostly the cost of doing the CheckACLPermissions on every request now.
On my box, .NET 8 with PGO disabled I'm getting about...
main~62KPINGs / s- this branch ~43K
PINGs / s - this branch, but remove the call to
if (...)with theCheckACLPermissionscall ~58KPINGs / s
Total loss ~30%, and a solid ~25% (~15K PINGs / s) is lost in checking ACLs.
I'll noodle on speeding this up in the morning.
Doesn't appear to be allocations based on local profiling. Might be useful to add
[MemoryDiagnoser]to that benchmark, which I haven't had a chance to poke at yet.
I have updated the branch with [MemoryDiagnoser] and added Get/Set in addition to Ping, for completeness. PR for this is here: https://github.com/microsoft/garnet/pull/446
Mkay, because I can't leave well enough alone despite it being after 11 PM here...
One relatively simple approach would be to slap a int[] (trading space for time) on User that's used to cache the last ACL check indexed by each command (skipping any ACL checks on a 1, only transitioning from 0 to 1 on a successful ACL check). We just allocate a new int[] and swap it out whenever the user's ACL is modified. A really terrible hacked version of this gets in the ~59K range again locally, but that's with no testing so huge grain of salt.
I need to think on if there's any point in keeping the bit-packed version around; if not, we could just initialize the int[] with the "truth" and then there's not even an update.
Another, much less simple, approach would be to lean in to RespCommand being expanded and change the various ProcessXXXCommands methods into a single lookup into an array of function pointers / delegates. The array starts initialized with some function that does the ACL check and replaces the appropriate entry with a pointer to the corresponding NetworkXXX(...) method for the command on successful ACL. Subsequent invocations would jump straight to the command's actual implementation. On any user permission change, we reset the array.
That might ultimately be a bit cleaner, but it'd require a bunch of additional changes on top of this already gigantic change. There's also an indirection in there which makes me not so confident the perf would be any better.
So I'll keep pursuing the first approach. Of course, if there are any other clever ideas, I'd be happy to hear them.
Mkay, because I can't leave well enough alone despite it being after 11 PM here...
One relatively simple approach would be to slap a
int[](trading space for time) on User that's used to cache the last ACL check indexed by each command (skipping any ACL checks on a 1, only transitioning from 0 to 1 on a successful ACL check). We just allocate a newint[]and swap it out whenever the user's ACL is modified. A really terrible hacked version of this gets in the ~59K range again locally, but that's with no testing so huge grain of salt.I need to think on if there's any point in keeping the bit-packed version around; if not, we could just initialize the int[] with the "truth" and then there's not even an update.
Another, much less simple, approach would be to lean in to
RespCommandbeing expanded and change the variousProcessXXXCommandsmethods into a single lookup into an array of function pointers / delegates. The array starts initialized with some function that does the ACL check and replaces the appropriate entry with a pointer to the correspondingNetworkXXX(...)method for the command on successful ACL. Subsequent invocations would jump straight to the command's actual implementation. On any user permission change, we reset the array.That might ultimately be a bit cleaner, but it'd require a bunch of additional changes on top of this already gigantic change. There's also an indirection in there which makes me not so confident the perf would be any better.
So I'll keep pursuing the first approach. Of course, if there are any other clever ideas, I'd be happy to hear them.
I'm not sure we need to do this. The bitmap is nice and tight. There might be a simpler way to get back the performance. Can look into it tomorrow.
@badrishc some exploratory work in my latest commit clawed some of the perf back, down to a ~10% loss (62K vs 56K PINGs / sec on my local).
This doesn't introduce any caching or anything, it's a revision on the existing approach. Most "interesting" thing is it introduces an unauthorized user and checks against that instead of first checking IsAuthenticated on an interface - we're still covered by another call to IsAuthenticated later in the pipeline.
Going off my previous math, there's still maybe 2K loss to the ACL check, and another 4K to "misc other changes".
Pushed up one more exploratory commit that remembers CanAuthenticate so we can avoid an interface method call on each command.
Switching to BDN now that it's in main.
main (as of fcf880a0f5ef27384768dca43d89144fa30ae5fb)
| Method | Job | EnvironmentVariables | Runtime | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | 2.249 us | 0.0338 us | 0.0316 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | 23.161 us | 0.2178 us | 0.1931 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | 13.875 us | 0.0973 us | 0.0910 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 1.983 us | 0.0138 us | 0.0122 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 18.985 us | 0.0977 us | 0.0914 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 12.280 us | 0.1564 us | 0.1463 us | - |
this (as of 18d7d133e08b60726972db697f91c6ae761aee49), now with a UseACLs option
| Method | Job | EnvironmentVariables | Runtime | UseACLs | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | False | 2.378 us | 0.0232 us | 0.0217 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | False | 23.721 us | 0.2094 us | 0.1959 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | False | 14.102 us | 0.0550 us | 0.0514 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 2.033 us | 0.0154 us | 0.0144 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 18.883 us | 0.1771 us | 0.1656 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 11.194 us | 0.0779 us | 0.0728 us | - |
| InlinePing | .NET 6 | Empty | .NET 6.0 | True | 2.493 us | 0.0105 us | 0.0098 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | True | 23.715 us | 0.2381 us | 0.1989 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | True | 14.485 us | 0.0909 us | 0.0806 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 2.306 us | 0.0254 us | 0.0237 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 19.279 us | 0.1247 us | 0.1106 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 12.000 us | 0.0741 us | 0.0693 us | - |
full logs here: https://gist.github.com/kevin-montrose/8145ddfffa201735b50c78e7edf8f39f
Which is awful close to a wash when ACLs are disable, and maybe a 0.5us loss when enabled.
Still need to look at Badrish's proposal, so that's up next.
Still need to look at Badrish's proposal, so that's up next.
Adding a link to this thread for reference - https://github.com/microsoft/garnet/commit/f20925a381c47050ac02ce0b75504852d70707a4
exploratory commit that remembers CanAuthenticate
I don't think this will work with authenticators that might time out, e.g., the Aad authenticator.
exploratory commit that remembers CanAuthenticate
I don't think this will work with authenticators that might time out, e.g., the Aad authenticator.
IsAuthenticated is used for timeouts, CanAuthenticate (which I cached) is a constant for all current implementations so I think it'd be safe.
That said, I think your proposal here is cleaner - getting a perf run together then I'll update.
I've pushed up a version that merges some of my tweaks and @badrishc's proposal.
The CanAuthenticate cache is gone, but other changes remain: _user will never be null (so we can elide some other checks), NoAuth commands aren't special cased anymore (they're set in the bitmap like everything else), some less likely paths are NoInlining]'d, and some hot paths are AggressiveInlining'd.
Latest BDN (as of 1ed1635f17efef7100a733c498257f31a5abbd6d)
| Method | Job | EnvironmentVariables | Runtime | UseACLs | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | False | 2.229 us | 0.0194 us | 0.0172 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | False | 23.134 us | 0.1162 us | 0.1030 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | False | 13.650 us | 0.0628 us | 0.0557 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 1.890 us | 0.0082 us | 0.0077 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 18.742 us | 0.1390 us | 0.1160 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 11.005 us | 0.0559 us | 0.0495 us | - |
| InlinePing | .NET 6 | Empty | .NET 6.0 | True | 2.502 us | 0.0201 us | 0.0178 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | True | 23.270 us | 0.0814 us | 0.0722 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | True | 14.275 us | 0.0563 us | 0.0527 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 1.913 us | 0.0154 us | 0.0144 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 18.779 us | 0.1348 us | 0.1126 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 11.582 us | 0.0520 us | 0.0486 us | - |
Full log: https://gist.github.com/kevin-montrose/a1a7f9ef472c68fbb2b3b61c1c3f190d
Suggests a w/o ACL improvement, and about the same for perf with ACLs.
I did a sanity check with Embedded.perftest again, and while I trust it less than BDN numbers it also shows a gain of ~4K PINGs / sec (for about 6% gain, upt o ~66K total PINGs / sec; .NET 8, PGO disabled) in the no-ACL case.
Idea to avoid the interface call to authenticator in common case (not necessary for this PR):
https://github.com/microsoft/garnet/commit/17402f340806980a378682b807f1e4bb142268da
Latest perf numbers
BDN
main (0acff38b7f343cf26c58a20995b2675f8f42b634)
| Method | Job | EnvironmentVariables | Runtime | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | 2.246 us | 0.0275 us | 0.0244 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | 23.058 us | 0.1166 us | 0.1090 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | 13.873 us | 0.0944 us | 0.0837 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 1.986 us | 0.0089 us | 0.0084 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 19.673 us | 0.1043 us | 0.0975 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | 11.025 us | 0.0482 us | 0.0377 us | - |
aclImprovements (f5e16848e96aa152583e9410a750b1e0f31cf4d9)
| Method | Job | EnvironmentVariables | Runtime | UseACLs | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|---|---|---|---|
| InlinePing | .NET 6 | Empty | .NET 6.0 | False | 2.323 us | 0.0440 us | 0.0646 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | False | 23.092 us | 0.1001 us | 0.0887 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | False | 13.588 us | 0.0956 us | 0.0848 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 2.184 us | 0.0163 us | 0.0153 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 18.372 us | 0.1050 us | 0.0931 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | False | 11.196 us | 0.0495 us | 0.0439 us | - |
| InlinePing | .NET 6 | Empty | .NET 6.0 | True | 2.586 us | 0.0142 us | 0.0133 us | - |
| Set | .NET 6 | Empty | .NET 6.0 | True | 23.417 us | 0.0952 us | 0.0891 us | - |
| Get | .NET 6 | Empty | .NET 6.0 | True | 14.442 us | 0.0901 us | 0.0843 us | - |
| InlinePing | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 2.102 us | 0.0268 us | 0.0251 us | - |
| Set | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 19.076 us | 0.1852 us | 0.1641 us | - |
| Get | .NET 8 | DOTNET_TieredPGO=0 | .NET 8.0 | True | 11.330 us | 0.0473 us | 0.0419 us | - |
Which looks like around 0.2us loss.
I did grab Embedded.perftest numbers again, but they are wildly variable for me latest main - swinging by up to 30K PINGs / sec so I really don't trust them.
Idea to avoid the interface call to authenticator in common case (not necessary for this PR):
This is interesting (as is punching the authenticator into RespServerSession as a generic argument, to de-virtualize everything) but I want push back a little bit on the idea that NoAuth is the common case.
While it certainly isn't unusual to deploy Redis (and thus I'd assume Garnet) with NoAuth and rely on network isolation for security, there are big cases where that's not true. Top of my mind is if Garnet is ever shipped as a service itself, or in any environment where defense in depth is a big thing.
Thinking out loud a bit, it seems to me a remaining perf killer is that we have to call into the IGarnetAuthenticator on every command to validate our User is still good - a poll model, if you will. Given that the common case is that the User is still good (because invalidation is rare, AD/Kerberos sessions get measured in hours, password rotations happen on the order of months, etc. etc.) it feels a push model is more appropriate.
One way to IGarnetAuthenticator into a push model, shooting from the hip:
public interface IGarnetAuthenticator
{
/// <summary>
/// Can authenticator authenticate
/// </summary>
bool CanAuthenticate { get; }
/// <summary>
/// Whether this authenticator can be used with the ACL
/// </summary>
bool HasACLSupport { get; }
/// <summary>
/// Authenticate the incoming username and password from AUTH command. Username is optional
/// </summary>
bool TryAuthenticate(ReadOnlySpan<byte> password, ReadOnlySpan<byte> username, out AuthenticationLifetime authLifetime);
}
public sealed class AuthenticationLifetime
{
// for Authenticators that support auth but not invalidation, or we use null as a special case
public static AuthenticationLifetime Infinite = ... ;
// we call this in RespServerSession, it get devirtualized into a simple bool check
public bool IsValid { get; }
// IGarnetAuthenticators call this when a previously auth'd user (from a call to IGarnetAuthenticator.TryAuthenticate) is de-auth'd
public void Invalidate() { ... }
}
Our hotpath logic in RespServerSession then ends up looking like:
// note _authLifetime gets initialized with some "already invalidated" value so we wouldn't need a null check
(!_authLifetime.IsValid || !_user.CanAccessCommand(cmd)) && !cmd.IsNoAuth()
which is virtual-call-less.
I'm not proposing to do this as part of this PR.
To be clear, the IGarnetAuthenticator implementation would have to remember the AuthenticationLifetimes it hands back to invalidate them - but I think that's fine, it already has to remember stuff about the user to implement IsAuthenticated.
This would keep the User and authentication separation we have, get the IGarnetAuthenticator interface out of the hotpath, and I think cover all reasonable schemes.
Just noting that in real-world workload, the .NET 8 DynamicPGO (which is disabled in our BDN microbenchmarks to reduce noise) would likely specialize the virtualized calls for common types here and we shouldn't have to do it ourselves. Whether it kicks in here is a bit harder to verify but something to keep in mind.
Thinking out loud a bit, it seems to me a remaining perf killer is that we have to call into the
IGarnetAuthenticatoron every command to validate ourUseris still good - a poll model, if you will. Given that the common case is that theUseris still good (because invalidation is rare, AD/Kerberos sessions get measured in hours, password rotations happen on the order of months, etc. etc.) it feels a push model is more appropriate.
Yeah, it makes sense to consider a push-based approach to update the auth status in the rare case of it being revoked instead of having to poll the provider every single time. cc @yangmsft whose team uses the auth feature. (definitely out of scope for this PR)