vault icon indicating copy to clipboard operation
vault copied to clipboard

1.15.x HTTP API endpoint "/sys/capabilities-self" queries very slow

Open Plipplopplip opened this issue 1 year ago • 10 comments
trafficstars

Hi,

I'm using a python script that checks capabilities on more than 8500 kv/secret (v2) paths for end users token.

Since versions 1.15.x, the script has become very slow ! ... and can exceed some token's ttl.

After some tests, I found out that the problem was with the "/sys/capabilities-self" API endpoint. This API endpoint query may take longer than 120 seconds compared to 5 seconds with version 1.14.x.

Additional information : Tests realised with version 1.15.0/1/2. Downgrade vault to version 1.14.6, restore performance. Problem also present with "/sys/capabilities" endpoint.

For example : Endpoint query "/sys/capabilities-self" on 1000 kv/secret paths takes :

  • 15,4 seconds with version 1.15.2
  • 1.2 seconds with version 1.14.6

Environment :

  • HA Cluster with 3 members on RAFT Storage
  • Debian bookworm

Best regards,

Plipplopplip avatar Nov 29 '23 16:11 Plipplopplip

Hi, I need a bit more information in order to try to reproduce this. Do the tokens that you're querying for have token_policies or identity_policies? And are you using namespaces when you're making these queries?

miagilepner avatar Nov 30 '23 15:11 miagilepner

Hi,

I'm not using namespaces.

Users tokens are created on the fly and linked to a role and a policy :

https://localhost:8080/v1/auth/token/roles/gu-tac
with parameters :
'{"allowed_policies": ["gu-tac"],"bound_cidrs": ["127.0.0.1/32","172.19.xxx.xxx/32"], "renewable": false, "token_explicit_max_ttl":"3m", "token_no_default_policy":"true", "orphan": true}'
https://localhost:8080/v1/auth/token/create/gu-tac
with parameters :
'{"meta": {"user": "john.doe"}, "ttl": "3m"}'

Bellow a typical example of policies we use :

path "sys/capabilities-self" {
    capabilities = ["update"]
}

path "TEST/+/SUB/*" {capabilities = ["read"]}
path "TEST2/+/SUB2/secret2" {capabilities = ["read"]}
[....]
path "LDAP/+/secret_ldap" {capabilities = ["read"]}

Thanks a lot !

Plipplopplip avatar Dec 01 '23 15:12 Plipplopplip

@Plipplopplip how many policies do you have?

miagilepner avatar Dec 08 '23 10:12 miagilepner

@miagilepner Hello, Currently we have 38 policies.

Plipplopplip avatar Dec 11 '23 08:12 Plipplopplip

Hi @Plipplopplip :wave:, I'm taking over from @miagilepner looking into this issue. Looking at the sample policy you provided above, is the + symbol meant to replace the operational prefixes: metadata, data, delete, undelete, destroy, and subkeys? Could you also give me a general idea of the structure of the secrets in that KV secrets engine mount? Are they all grouped together under a common parent path, or are to distributed into sub-trees. I don't think that I need specifics here, just a general idea.

I want to make sure that I'm setting up a test environment that replicates the circumstances that you're experiencing.

marcboudreau avatar Dec 11 '23 22:12 marcboudreau

Hi @marcboudreau ! The aim of the + symbol is to permit to end-user to retrieved "data" and "metadata". Secrets are stored in 5 engines and the longuest path is currently 4, for example :

ENGINE1\SUB1\SUB11\secret ENGINE1\SUB2\SUB21\secret

ENGINE2\SUB1\SUB11\SUB111\secret ENGINE2\SUB1\SUB12\SUB121\secret ENGINE2\SUB2\SUB21\SUB211\secret ENGINE2\SUB2\SUB21\secret

ENGINE3\secret ENGINE3\SUB1\secret

ENGINE4\SUB1\secret ENGINE4\SUB2\secret

ENGINE5\SUB1\SUB11\SUB111\secret

Regards

Plipplopplip avatar Dec 12 '23 09:12 Plipplopplip

The principal difference in the sys/capabilities-self handler, between 1.15.2 and 1.14.6, is the introduction of the Events beta feature. I've looked at the code closely for any kind loops that have been introduced that could augment the response time by an order of magnitude but didn't find anything. Could you kindly confirm that the policies involved (those associated with the Vault Token used in the sys/capabilities-self requests) do not include the subscribe_event_types argument as shown in this example policy.

I'm wondering if the slowdown is unrelated to this particular endpoint. Would you be able to alter the test to something else (e.g. the metadata of the secrets)?

marcboudreau avatar Dec 12 '23 17:12 marcboudreau

Hello @marcboudreau .... I confirm that policies don't include "subscribe_event_types" entry.

I'll try to change the test to retrieve capabilities on "metadata" vs "data". I'll keep you informed

Plipplopplip avatar Dec 13 '23 13:12 Plipplopplip

Also if you could capture profiling data with the vault debug command and send us the debug package for both the 1.14.6 (good) and 1.15.x (bad), it will allow us to study it to try and determine what is causing the slow down you're experiencing.

marcboudreau avatar Dec 13 '23 18:12 marcboudreau

Hi @marcboudreau

I made some new tests with versions 1.15.4 vs 1.14.8 and with a list of 1000 paths to retrieve capabilities on secrets's "data" or "metadata". Results are unfortunately identical :

  • With 1.14.8 :

    • capabilities-self on 1000 paths "data" >>> 1.04 seconds
    • capabilities-self on 1000 paths "metadata" >>> 0.98 seconds
  • With 1.15.4 :

    • capabilities-self on 1000 paths "data" >>> 15.42 seconds
    • capabilities-self on 1000 paths "metadata" >>> 15.43 seconds

Plipplopplip avatar Dec 22 '23 10:12 Plipplopplip

Hi @marcboudreau

First of all, sorry for my very late response !

I made some new tests with versions 1.14.10 vs 1.15.6 vs 1.16.1 for endpoint "/sys/capabilities-self" with a list of 1000 KV secrets path. Results are same as mentioned above :

  • With 1.14.10 : response time >>> 1.03 seconds

  • With 1.15.6 : response time >>> 14.62 seconds

  • With 1.16.1 : response time >>> 15.34 seconds

You'll find in attached debug files !

Thanks in advance.

Philippe vault-debug-1_14_10.tar.gz vault-debug-1_15_6.tar.gz vault-debug-1_16_1.tar.gz

Plipplopplip avatar Apr 17 '24 13:04 Plipplopplip