vault
vault copied to clipboard
1.15.x HTTP API endpoint "/sys/capabilities-self" queries very slow
Hi,
I'm using a python script that checks capabilities on more than 8500 kv/secret (v2) paths for end users token.
Since versions 1.15.x, the script has become very slow ! ... and can exceed some token's ttl.
After some tests, I found out that the problem was with the "/sys/capabilities-self" API endpoint. This API endpoint query may take longer than 120 seconds compared to 5 seconds with version 1.14.x.
Additional information : Tests realised with version 1.15.0/1/2. Downgrade vault to version 1.14.6, restore performance. Problem also present with "/sys/capabilities" endpoint.
For example : Endpoint query "/sys/capabilities-self" on 1000 kv/secret paths takes :
- 15,4 seconds with version 1.15.2
- 1.2 seconds with version 1.14.6
Environment :
- HA Cluster with 3 members on RAFT Storage
- Debian bookworm
Best regards,
Hi, I need a bit more information in order to try to reproduce this. Do the tokens that you're querying for have token_policies or identity_policies? And are you using namespaces when you're making these queries?
Hi,
I'm not using namespaces.
Users tokens are created on the fly and linked to a role and a policy :
https://localhost:8080/v1/auth/token/roles/gu-tac
with parameters :
'{"allowed_policies": ["gu-tac"],"bound_cidrs": ["127.0.0.1/32","172.19.xxx.xxx/32"], "renewable": false, "token_explicit_max_ttl":"3m", "token_no_default_policy":"true", "orphan": true}'
https://localhost:8080/v1/auth/token/create/gu-tac
with parameters :
'{"meta": {"user": "john.doe"}, "ttl": "3m"}'
Bellow a typical example of policies we use :
path "sys/capabilities-self" {
capabilities = ["update"]
}
path "TEST/+/SUB/*" {capabilities = ["read"]}
path "TEST2/+/SUB2/secret2" {capabilities = ["read"]}
[....]
path "LDAP/+/secret_ldap" {capabilities = ["read"]}
Thanks a lot !
@Plipplopplip how many policies do you have?
@miagilepner Hello, Currently we have 38 policies.
Hi @Plipplopplip :wave:, I'm taking over from @miagilepner looking into this issue. Looking at the sample policy you provided above, is the + symbol meant to replace the operational prefixes: metadata, data, delete, undelete, destroy, and subkeys? Could you also give me a general idea of the structure of the secrets in that KV secrets engine mount? Are they all grouped together under a common parent path, or are to distributed into sub-trees. I don't think that I need specifics here, just a general idea.
I want to make sure that I'm setting up a test environment that replicates the circumstances that you're experiencing.
Hi @marcboudreau ! The aim of the + symbol is to permit to end-user to retrieved "data" and "metadata". Secrets are stored in 5 engines and the longuest path is currently 4, for example :
ENGINE1\SUB1\SUB11\secret ENGINE1\SUB2\SUB21\secret
ENGINE2\SUB1\SUB11\SUB111\secret ENGINE2\SUB1\SUB12\SUB121\secret ENGINE2\SUB2\SUB21\SUB211\secret ENGINE2\SUB2\SUB21\secret
ENGINE3\secret ENGINE3\SUB1\secret
ENGINE4\SUB1\secret ENGINE4\SUB2\secret
ENGINE5\SUB1\SUB11\SUB111\secret
Regards
The principal difference in the sys/capabilities-self handler, between 1.15.2 and 1.14.6, is the introduction of the Events beta feature. I've looked at the code closely for any kind loops that have been introduced that could augment the response time by an order of magnitude but didn't find anything. Could you kindly confirm that the policies involved (those associated with the Vault Token used in the sys/capabilities-self requests) do not include the subscribe_event_types argument as shown in this example policy.
I'm wondering if the slowdown is unrelated to this particular endpoint. Would you be able to alter the test to something else (e.g. the metadata of the secrets)?
Hello @marcboudreau .... I confirm that policies don't include "subscribe_event_types" entry.
I'll try to change the test to retrieve capabilities on "metadata" vs "data". I'll keep you informed
Also if you could capture profiling data with the vault debug command and send us the debug package for both the 1.14.6 (good) and 1.15.x (bad), it will allow us to study it to try and determine what is causing the slow down you're experiencing.
Hi @marcboudreau
I made some new tests with versions 1.15.4 vs 1.14.8 and with a list of 1000 paths to retrieve capabilities on secrets's "data" or "metadata". Results are unfortunately identical :
-
With 1.14.8 :
- capabilities-self on 1000 paths "data" >>> 1.04 seconds
- capabilities-self on 1000 paths "metadata" >>> 0.98 seconds
-
With 1.15.4 :
- capabilities-self on 1000 paths "data" >>> 15.42 seconds
- capabilities-self on 1000 paths "metadata" >>> 15.43 seconds
Hi @marcboudreau
First of all, sorry for my very late response !
I made some new tests with versions 1.14.10 vs 1.15.6 vs 1.16.1 for endpoint "/sys/capabilities-self" with a list of 1000 KV secrets path. Results are same as mentioned above :
-
With 1.14.10 : response time >>> 1.03 seconds
-
With 1.15.6 : response time >>> 14.62 seconds
-
With 1.16.1 : response time >>> 15.34 seconds
You'll find in attached debug files !
Thanks in advance.
Philippe vault-debug-1_14_10.tar.gz vault-debug-1_15_6.tar.gz vault-debug-1_16_1.tar.gz