This sketches out a more proper ACL implementation - one with categories beyond @admin, and where individual commands could be ACL'd (current implementation does not do that, however).

Basic idea is to use RespCommandsInfo as a mechanism to discover command ACLs, and attach a bitfield to each User to track individual commands rather than categories.

Outstanding TODOs:

[x] Test rest of commands w.r.t. ACL categories
[x] Implement individual command ACL'ing (ie. +get)
[ ] Implement sub-command ACL'ing (ie. -config|get)
[x] Figure out what to do with unlisted/internal/cluster commands
[x] Figure out what to do with custom extensions
- Introduced a @garnet ACL category that all commands not in Redis are now tagged with
[x] (Probably) Remove CommandCategory - it is redundant
[x] Clean up command parsing, now that subcommand bytes are a "later in the pipeline'-thing
- As part of this, probably DRY up ACL checking - it's a bit error prone to add a new command with this code so "wet"
- SET (with EX, NX, etc. but not SETEX) are still special cased as they weren't realized as subCmds
[ ] Cluster subcommands are weird, but need ACL testing too

... and a bunch of little ones in the code as comments.

This is explicitly not going to implement keyspace patterns. I feel like that should be a separate enhancement, as it's pretty substantial on it's own.

The biggest incidental change here is a substantial refactoring of RespCommand and the various switches over it. This is, IMO, the cleanest way to get a canonical value to test against ACLs with. As a consequence, RespCommand has a bunch more entries as most things that were previously "sub commands" (but not in the RESP sense) are now actual enum values.

May 15 '24 22:05 kevin-montrose

There's still some more to do here (more testing, subcommands, some cleanup) , but I'm going on vacation for a week and this is pretty dang close to how it's going to actually look when done. So taking it off draft, and marking ready for review (if not merging).

/cc @mgravell @NickCraver - this is relevant to your interests.

May 21 '24 16:05 kevin-montrose

I think this PR has too many changes to vet, maybe we need to break it into smaller PRs. I think the first is good but very hard to review because the surface area is huge. I think the idea of having a bitmap to match command permissions is good. There might I couple alternatives to tradeoff between CPU cycles or memory. We need to separate general fixes from clear ACL enchantments and avoid pushing parsing inside the execution methods.

Yeah, it's gigantic - ACLs touch literally everything. Excluding tests, it is trending smaller as things DRY up.

A bunch of random-ish fixes are somewhat unfortunate, as a big part of testing this is making sure that failed ACLs don't bork the command stream - and I encountered a number of cases of fairly "brittle" command implementations. I'm open to breaking those out into separate PRs once ACLs are in a good place, but I needed some confidence that the ACL design would work for all commands upfront.

May 28 '24 17:05 kevin-montrose

Great work overall. Shaping up really nicely, thanks!

May 29 '24 19:05 badrishc

@lmaas @vazois @badrishc This is now in a complete state from my perspective, and ready for full review.

Jun 04 '24 19:06 kevin-montrose

Benchmark (Release mode): Embedded.perftest --op-workload PING --op-percent 100 -t 1

Framework: net8.0

Branch: main (on Kevin's fork)

Running with 1 threads... Avg. throughput: 135719.46 Kops/sec; 135719.46 Kops/sec/thread (Max: 135719.46, Min: 135719.46)

Branch: aclImprovements (on Kevin's fork)

Running with 1 threads... Avg. throughput: 62371.18 Kops/sec; 62371.18 Kops/sec/thread (Max: 62371.18, Min: 62371.18)

Summary

Looks like we are at around 54% performance regression at the moment. Need to drill deeper into why.

Edit: looks like there is too much variance across runs so take above with a grain of salt.

Jun 05 '24 00:06 badrishc

Benchmark (Release mode): Embedded.perftest --op-workload PING --op-percent 100 -t 1

Framework: net8.0

Branch: main (on Kevin's fork)

Running with 1 threads... Avg. throughput: 135719.46 Kops/sec; 135719.46 Kops/sec/thread (Max: 135719.46, Min: 135719.46)

Branch: aclImprovements (on Kevin's fork)

I reproduce this regression on my latest too, I'll spend some time digging (now that all feedback is addressed) and update when I have an explanation/fix.

Jun 05 '24 21:06 kevin-montrose

I reproduce this regression on my latest too, I'll spend some time digging (now that all feedback is addressed) and update when I have an explanation/fix.

Is it possible something is inadvertently getting boxed?
I made this into a BDN testcase here: https://github.com/microsoft/garnet/compare/main...badrishc/resp-parse-stress - here are the results comparing main to this PR:

main:

Method	Job	EnvironmentVariables	Runtime	Mean	Error	StdDev
InlinePing	.NET 6	Empty	.NET 6.0	4.292 us	0.0446 us	0.0395 us
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	2.449 us	0.0052 us	0.0040 us

PR:

Method	Job	EnvironmentVariables	Runtime	Mean	Error	StdDev
InlinePing	.NET 6	Empty	.NET 6.0	4.937 us	0.0462 us	0.0409 us
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	3.958 us	0.0443 us	0.0393 us

Jun 05 '24 23:06 badrishc

Is it possible something is inadvertently getting boxed? I made this into a BDN testcase here: https://github.com/microsoft/garnet/compare/main...badrishc/resp-parse-stress - here are the results comparing main to this PR:

Doesn't appear to be allocations based on local profiling. Might be useful to add [MemoryDiagnoser] to that benchmark, which I haven't had a chance to poke at yet.

Doing some quick and dirty benchmarking with the Embedded.perftest example, it looks like it's mostly the cost of doing the CheckACLPermissions on every request now.

On my box, .NET 8 with PGO disabled I'm getting about...

main ~62K PINGs / s
this branch ~43K PINGs / s
this branch, but remove the call to if (...) with the CheckACLPermissions call ~58K PINGs / s

Total loss ~30%, and a solid ~25% (~15K PINGs / s) is lost in checking ACLs.

I'll noodle on speeding this up in the morning.

Jun 06 '24 02:06 kevin-montrose

Doesn't appear to be allocations based on local profiling. Might be useful to add [MemoryDiagnoser] to that benchmark, which I haven't had a chance to poke at yet.

I have updated the branch with [MemoryDiagnoser] and added Get/Set in addition to Ping, for completeness. PR for this is here: https://github.com/microsoft/garnet/pull/446

Jun 06 '24 02:06 badrishc

Mkay, because I can't leave well enough alone despite it being after 11 PM here...

One relatively simple approach would be to slap a int[] (trading space for time) on User that's used to cache the last ACL check indexed by each command (skipping any ACL checks on a 1, only transitioning from 0 to 1 on a successful ACL check). We just allocate a new int[] and swap it out whenever the user's ACL is modified. A really terrible hacked version of this gets in the ~59K range again locally, but that's with no testing so huge grain of salt.

I need to think on if there's any point in keeping the bit-packed version around; if not, we could just initialize the int[] with the "truth" and then there's not even an update.

Another, much less simple, approach would be to lean in to RespCommand being expanded and change the various ProcessXXXCommands methods into a single lookup into an array of function pointers / delegates. The array starts initialized with some function that does the ACL check and replaces the appropriate entry with a pointer to the corresponding NetworkXXX(...) method for the command on successful ACL. Subsequent invocations would jump straight to the command's actual implementation. On any user permission change, we reset the array.

That might ultimately be a bit cleaner, but it'd require a bunch of additional changes on top of this already gigantic change. There's also an indirection in there which makes me not so confident the perf would be any better.

So I'll keep pursuing the first approach. Of course, if there are any other clever ideas, I'd be happy to hear them.

Jun 06 '24 03:06 kevin-montrose

Mkay, because I can't leave well enough alone despite it being after 11 PM here...

One relatively simple approach would be to slap a int[] (trading space for time) on User that's used to cache the last ACL check indexed by each command (skipping any ACL checks on a 1, only transitioning from 0 to 1 on a successful ACL check). We just allocate a new int[] and swap it out whenever the user's ACL is modified. A really terrible hacked version of this gets in the ~59K range again locally, but that's with no testing so huge grain of salt.

I need to think on if there's any point in keeping the bit-packed version around; if not, we could just initialize the int[] with the "truth" and then there's not even an update.

Another, much less simple, approach would be to lean in to RespCommand being expanded and change the various ProcessXXXCommands methods into a single lookup into an array of function pointers / delegates. The array starts initialized with some function that does the ACL check and replaces the appropriate entry with a pointer to the corresponding NetworkXXX(...) method for the command on successful ACL. Subsequent invocations would jump straight to the command's actual implementation. On any user permission change, we reset the array.

That might ultimately be a bit cleaner, but it'd require a bunch of additional changes on top of this already gigantic change. There's also an indirection in there which makes me not so confident the perf would be any better.

So I'll keep pursuing the first approach. Of course, if there are any other clever ideas, I'd be happy to hear them.

I'm not sure we need to do this. The bitmap is nice and tight. There might be a simpler way to get back the performance. Can look into it tomorrow.

Jun 06 '24 05:06 badrishc

@badrishc some exploratory work in my latest commit clawed some of the perf back, down to a ~10% loss (62K vs 56K PINGs / sec on my local).

This doesn't introduce any caching or anything, it's a revision on the existing approach. Most "interesting" thing is it introduces an unauthorized user and checks against that instead of first checking IsAuthenticated on an interface - we're still covered by another call to IsAuthenticated later in the pipeline.

Going off my previous math, there's still maybe 2K loss to the ACL check, and another 4K to "misc other changes".

Jun 06 '24 16:06 kevin-montrose

Pushed up one more exploratory commit that remembers CanAuthenticate so we can avoid an interface method call on each command.

Switching to BDN now that it's in main.

main (as of fcf880a0f5ef27384768dca43d89144fa30ae5fb)

Method	Job	EnvironmentVariables	Runtime	Mean	Error	StdDev	Allocated
InlinePing	.NET 6	Empty	.NET 6.0	2.249 us	0.0338 us	0.0316 us	-
Set	.NET 6	Empty	.NET 6.0	23.161 us	0.2178 us	0.1931 us	-
Get	.NET 6	Empty	.NET 6.0	13.875 us	0.0973 us	0.0910 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	1.983 us	0.0138 us	0.0122 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	18.985 us	0.0977 us	0.0914 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	12.280 us	0.1564 us	0.1463 us	-

this (as of 18d7d133e08b60726972db697f91c6ae761aee49), now with a UseACLs option

Method	Job	EnvironmentVariables	Runtime	UseACLs	Mean	Error	StdDev	Allocated
InlinePing	.NET 6	Empty	.NET 6.0	False	2.378 us	0.0232 us	0.0217 us	-
Set	.NET 6	Empty	.NET 6.0	False	23.721 us	0.2094 us	0.1959 us	-
Get	.NET 6	Empty	.NET 6.0	False	14.102 us	0.0550 us	0.0514 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	2.033 us	0.0154 us	0.0144 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	18.883 us	0.1771 us	0.1656 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	11.194 us	0.0779 us	0.0728 us	-
InlinePing	.NET 6	Empty	.NET 6.0	True	2.493 us	0.0105 us	0.0098 us	-
Set	.NET 6	Empty	.NET 6.0	True	23.715 us	0.2381 us	0.1989 us	-
Get	.NET 6	Empty	.NET 6.0	True	14.485 us	0.0909 us	0.0806 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	2.306 us	0.0254 us	0.0237 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	19.279 us	0.1247 us	0.1106 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	12.000 us	0.0741 us	0.0693 us	-

full logs here: https://gist.github.com/kevin-montrose/8145ddfffa201735b50c78e7edf8f39f

Which is awful close to a wash when ACLs are disable, and maybe a 0.5us loss when enabled.

Still need to look at Badrish's proposal, so that's up next.

Jun 06 '24 19:06 kevin-montrose

Still need to look at Badrish's proposal, so that's up next.

Adding a link to this thread for reference - https://github.com/microsoft/garnet/commit/f20925a381c47050ac02ce0b75504852d70707a4

Jun 06 '24 19:06 badrishc

exploratory commit that remembers CanAuthenticate

I don't think this will work with authenticators that might time out, e.g., the Aad authenticator.

Jun 06 '24 19:06 badrishc

exploratory commit that remembers CanAuthenticate

I don't think this will work with authenticators that might time out, e.g., the Aad authenticator.

IsAuthenticated is used for timeouts, CanAuthenticate (which I cached) is a constant for all current implementations so I think it'd be safe.

That said, I think your proposal here is cleaner - getting a perf run together then I'll update.

Jun 06 '24 19:06 kevin-montrose

I've pushed up a version that merges some of my tweaks and @badrishc's proposal.

The CanAuthenticate cache is gone, but other changes remain: _user will never be null (so we can elide some other checks), NoAuth commands aren't special cased anymore (they're set in the bitmap like everything else), some less likely paths are NoInlining]'d, and some hot paths are AggressiveInlining'd.

Latest BDN (as of 1ed1635f17efef7100a733c498257f31a5abbd6d)

Method	Job	EnvironmentVariables	Runtime	UseACLs	Mean	Error	StdDev	Allocated
InlinePing	.NET 6	Empty	.NET 6.0	False	2.229 us	0.0194 us	0.0172 us	-
Set	.NET 6	Empty	.NET 6.0	False	23.134 us	0.1162 us	0.1030 us	-
Get	.NET 6	Empty	.NET 6.0	False	13.650 us	0.0628 us	0.0557 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	1.890 us	0.0082 us	0.0077 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	18.742 us	0.1390 us	0.1160 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	11.005 us	0.0559 us	0.0495 us	-
InlinePing	.NET 6	Empty	.NET 6.0	True	2.502 us	0.0201 us	0.0178 us	-
Set	.NET 6	Empty	.NET 6.0	True	23.270 us	0.0814 us	0.0722 us	-
Get	.NET 6	Empty	.NET 6.0	True	14.275 us	0.0563 us	0.0527 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	1.913 us	0.0154 us	0.0144 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	18.779 us	0.1348 us	0.1126 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	11.582 us	0.0520 us	0.0486 us	-

Full log: https://gist.github.com/kevin-montrose/a1a7f9ef472c68fbb2b3b61c1c3f190d

Suggests a w/o ACL improvement, and about the same for perf with ACLs.

I did a sanity check with Embedded.perftest again, and while I trust it less than BDN numbers it also shows a gain of ~4K PINGs / sec (for about 6% gain, upt o ~66K total PINGs / sec; .NET 8, PGO disabled) in the no-ACL case.

Jun 06 '24 20:06 kevin-montrose

Idea to avoid the interface call to authenticator in common case (not necessary for this PR):

https://github.com/microsoft/garnet/commit/17402f340806980a378682b807f1e4bb142268da

Jun 07 '24 01:06 badrishc

Latest perf numbers

BDN

main (0acff38b7f343cf26c58a20995b2675f8f42b634)

Method	Job	EnvironmentVariables	Runtime	Mean	Error	StdDev	Allocated
InlinePing	.NET 6	Empty	.NET 6.0	2.246 us	0.0275 us	0.0244 us	-
Set	.NET 6	Empty	.NET 6.0	23.058 us	0.1166 us	0.1090 us	-
Get	.NET 6	Empty	.NET 6.0	13.873 us	0.0944 us	0.0837 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	1.986 us	0.0089 us	0.0084 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	19.673 us	0.1043 us	0.0975 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	11.025 us	0.0482 us	0.0377 us	-

aclImprovements (f5e16848e96aa152583e9410a750b1e0f31cf4d9)

Method	Job	EnvironmentVariables	Runtime	UseACLs	Mean	Error	StdDev	Allocated
InlinePing	.NET 6	Empty	.NET 6.0	False	2.323 us	0.0440 us	0.0646 us	-
Set	.NET 6	Empty	.NET 6.0	False	23.092 us	0.1001 us	0.0887 us	-
Get	.NET 6	Empty	.NET 6.0	False	13.588 us	0.0956 us	0.0848 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	2.184 us	0.0163 us	0.0153 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	18.372 us	0.1050 us	0.0931 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	False	11.196 us	0.0495 us	0.0439 us	-
InlinePing	.NET 6	Empty	.NET 6.0	True	2.586 us	0.0142 us	0.0133 us	-
Set	.NET 6	Empty	.NET 6.0	True	23.417 us	0.0952 us	0.0891 us	-
Get	.NET 6	Empty	.NET 6.0	True	14.442 us	0.0901 us	0.0843 us	-
InlinePing	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	2.102 us	0.0268 us	0.0251 us	-
Set	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	19.076 us	0.1852 us	0.1641 us	-
Get	.NET 8	DOTNET_TieredPGO=0	.NET 8.0	True	11.330 us	0.0473 us	0.0419 us	-

Which looks like around 0.2us loss.

I did grab Embedded.perftest numbers again, but they are wildly variable for me latest main - swinging by up to 30K PINGs / sec so I really don't trust them.

Jun 07 '24 19:06 kevin-montrose

Idea to avoid the interface call to authenticator in common case (not necessary for this PR):

17402f3

This is interesting (as is punching the authenticator into RespServerSession as a generic argument, to de-virtualize everything) but I want push back a little bit on the idea that NoAuth is the common case.

While it certainly isn't unusual to deploy Redis (and thus I'd assume Garnet) with NoAuth and rely on network isolation for security, there are big cases where that's not true. Top of my mind is if Garnet is ever shipped as a service itself, or in any environment where defense in depth is a big thing.

Thinking out loud a bit, it seems to me a remaining perf killer is that we have to call into the IGarnetAuthenticator on every command to validate our User is still good - a poll model, if you will. Given that the common case is that the User is still good (because invalidation is rare, AD/Kerberos sessions get measured in hours, password rotations happen on the order of months, etc. etc.) it feels a push model is more appropriate.

One way to IGarnetAuthenticator into a push model, shooting from the hip:

public interface IGarnetAuthenticator
{
    /// <summary>
    /// Can authenticator authenticate
    /// </summary>
    bool CanAuthenticate { get; }

    /// <summary>
    /// Whether this authenticator can be used with the ACL
    /// </summary>
    bool HasACLSupport { get; }

    /// <summary>
    /// Authenticate the incoming username and password from AUTH command. Username is optional
    /// </summary>
    bool TryAuthenticate(ReadOnlySpan<byte> password, ReadOnlySpan<byte> username, out AuthenticationLifetime authLifetime);
}

public sealed class AuthenticationLifetime
{
    // for Authenticators that support auth but not invalidation, or we use null as a special case
    public static AuthenticationLifetime Infinite = ... ;

    // we call this in RespServerSession, it get devirtualized into a simple bool check
    public bool IsValid { get; }

    // IGarnetAuthenticators call this when a previously auth'd user (from a call to IGarnetAuthenticator.TryAuthenticate) is de-auth'd
    public void Invalidate() { ... }
}

Our hotpath logic in RespServerSession then ends up looking like:

// note _authLifetime gets initialized with some "already invalidated" value so we wouldn't need a null check
(!_authLifetime.IsValid || !_user.CanAccessCommand(cmd)) && !cmd.IsNoAuth()

which is virtual-call-less.

I'm not proposing to do this as part of this PR.

To be clear, the IGarnetAuthenticator implementation would have to remember the AuthenticationLifetimes it hands back to invalidate them - but I think that's fine, it already has to remember stuff about the user to implement IsAuthenticated.

This would keep the User and authentication separation we have, get the IGarnetAuthenticator interface out of the hotpath, and I think cover all reasonable schemes.

Jun 07 '24 20:06 kevin-montrose

Just noting that in real-world workload, the .NET 8 DynamicPGO (which is disabled in our BDN microbenchmarks to reduce noise) would likely specialize the virtualized calls for common types here and we shouldn't have to do it ourselves. Whether it kicks in here is a bit harder to verify but something to keep in mind.

Jun 07 '24 20:06 PaulusParssinen

Thinking out loud a bit, it seems to me a remaining perf killer is that we have to call into the IGarnetAuthenticator on every command to validate our User is still good - a poll model, if you will. Given that the common case is that the User is still good (because invalidation is rare, AD/Kerberos sessions get measured in hours, password rotations happen on the order of months, etc. etc.) it feels a push model is more appropriate.

Yeah, it makes sense to consider a push-based approach to update the auth status in the rare case of it being revoked instead of having to poll the provider every single time. cc @yangmsft whose team uses the auth feature. (definitely out of scope for this PR)

Jun 07 '24 21:06 badrishc

garnet
garnet copied to clipboard

More Complete ACL Implementation

Framework: net8.0

Branch: main (on Kevin's fork)

Branch: aclImprovements (on Kevin's fork)

Summary

Framework: net8.0

Branch: main (on Kevin's fork)

Branch: aclImprovements (on Kevin's fork)

BDN

main (0acff38b7f343cf26c58a20995b2675f8f42b634)

aclImprovements (f5e16848e96aa152583e9410a750b1e0f31cf4d9)

garnet garnet copied to clipboard

More Complete ACL Implementation

Framework: net8.0

Branch: main (on Kevin's fork)

Branch: aclImprovements (on Kevin's fork)

Summary

Framework: net8.0

Branch: main (on Kevin's fork)

Branch: aclImprovements (on Kevin's fork)

BDN

main (0acff38b7f343cf26c58a20995b2675f8f42b634)

aclImprovements (f5e16848e96aa152583e9410a750b1e0f31cf4d9)

garnet
garnet copied to clipboard