High CPU usage when authentication fails (returns 204)
Describe the bug
A clear and concise description of what the bug is.
Steps to reproduce
- start opencloud
- run k6 tests without created test user - do not run
src-seeds-up-k6.js
test script - run as ./script-name.sh
#!/usr/bin/env bash
export SEED_USERS_TOTAL=30
export ADMIN_LOGIN=admin
export ADMIN_PASSWORD=admin
export PLATFORM_BASE_URL=https://localhost:9200
export AUTH_N_PROVIDER_KOPANO_BASE_URL=https://localhost:9200
export AUTH_N_PROVIDER_KOPANO_REDIRECT_URL=https://localhost:9200/oidc-callback.html
export PLATFORM_TYPE=openCloud
export TEST_KOKO_PLATFORM_020_RAMPING_SLEEP_AFTER_ITERATION=10s
export TEST_KOKO_PLATFORM_020_RAMPING_STAGES_VUS=4
export TEST_KOKO_PLATFORM_020_RAMPING_STAGES_UP_DURATION=20s
export TEST_KOKO_PLATFORM_020_RAMPING_STAGES_PEAK_DURATION=1m
export TEST_KOKO_PLATFORM_020_RAMPING_STAGES_DOWN_DURATION=20s
k6 run packages/k6-tests/artifacts/tests-koko-platform-020-navigate-file-tree-ramping-k6.js
Expected behavior
no CPU leak
Actual behavior
When the test runs with invalid credentials (no test users created), the IDP returns HTTP 204 No Content instead of an error. As a result, the proxy and backend processes consume over 500% CPU.
opencloud log:
k6-test log:
ERRO[0021] Uncaught (in promise) Error: logonResponse.status is 204, expected 200
running at getContinueURI (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:95:12(75))
navigateat credential (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:48:46(15))
at headers (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:36:36(3))
at file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/http.ts:26:49(24)
at Oe (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/endpoints/graph-v1-me.ts:7:20(20))
at getMyDrives (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/client/me.ts:38:70(40))
at getTestRoot (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/src/test/root.ts:48:62(76))
at call (native)
at file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/tests/koko/platform/020-navigate-file-tree/simple.k6.ts:38:39(27)
at set (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/store.ts:51:30(16))
at setOrGet (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/store.ts:58:23(15))
at navigate_file_tree_020 (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/tests/koko/platform/020-navigate-file-tree/simple.k6.ts:37:40(42)) executor=ramping-vus scenario=navigate_file_tree_020
ERRO[0021] Uncaught (in promise) Error: logonResponse.status is 204, expected 200
running at getContinueURI (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:95:12(75))
navigateat credential (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:48:46(15))
at headers (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/auth/kopano.ts:36:36(3))
at file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/http.ts:26:49(24)
at Oe (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/endpoints/graph-v1-me.ts:7:20(20))
at getMyDrives (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/client/me.ts:38:70(40))
at getTestRoot (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/src/test/root.ts:48:62(76))
at call (native)
at file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/tests/koko/platform/020-navigate-file-tree/simple.k6.ts:38:39(27)
at set (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/store.ts:51:30(16))
at setOrGet (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tdk/src/utils/store.ts:58:23(15))
at navigate_file_tree_020 (file:///Users/v.scharf/Work/opencloud/cdperf/packages/k6-tests/tests/koko/platform/020-navigate-file-tree/simple.k6.ts:37:40(42)) executor=ramping-vus scenario=navigate_file_tree_020
█ TOTAL RESULTS
checks_total.......: 920 44.498963/s
checks_succeeded...: 0.00% 0 out of 920
checks_failed......: 100.00% 920 out of 920
✗ authn -> logonResponse - status
↳ 0% — ✓ 0 / ✗ 920
HTTP
http_req_duration..............: avg=34.02ms min=12.47ms med=30.21ms max=248.94ms p(90)=52.16ms p(95)=61.7ms
{ expected_response:true }...: avg=34.02ms min=12.47ms med=30.21ms max=248.94ms p(90)=52.16ms p(95)=61.7ms
http_req_failed................: 0.00% 0 out of 920
http_reqs......................: 920 44.498963/s
EXECUTION
iteration_duration.............: avg=35.24ms min=13.98ms med=31.28ms max=249.83ms p(90)=53.4ms p(95)=63.96ms
iterations.....................: 920 44.498963/s
vus............................: 3 min=0 max=3
vus_max........................: 4 min=4 max=4
NETWORK
data_received..................: 1.0 MB 49 kB/s
data_sent......................: 412 kB 20 kB/s
Setup
Please describe how you started the server and provide a list of relevant environment variables or configuration files.
OC_XXX=somevalue
OC_YYY=somevalue
PROXY_XXX=somevalue
Additional context
Add any other context about the problem here.
This is a 5sec cpu profile recorded when running the test (while opencloud consume all of the available CPU). As can be seen 70% of the CPU is spent in computing argon2 hashes. Another 19% is spent in the garbage collector (unfortunately cut off from this screenshot, so you have to trust me).
The argon2 hash computation happens when the IDM service needs to validate passwords (https://github.com/opencloud-eu/opencloud/blob/main/vendor/github.com/libregraph/idm/pkg/ldappassword/ldappassword.go#L44). Argon2 is supposed to be CPU and Memory intensive.
You might wonder why we are checking passwords so often when there is not even a user in the system. This is mainly caused by the idp (lico) service not reusing its connection to the idm (LDAP) service. For every login attempt it opens a new connection and authenticates as the internal idp service user (this already causes the hash computation). That connections is used to perform an LDAP lookup of the username that was entered into the login form. In this special case the lookup will not succeed, return an error and the user login in fails.
So basically the system is busy re-authenticating the idp service user over and over again.
A possible solution would be to make lico use a persistent LDAP connection for the user lookup, like we're doing in all the other services. Actually we submitted a PR for that already quite a while ago: https://github.com/libregraph/lico/pull/77 it's basically "just" missing some reconnect logic.
I am not sure if this really deserves the Prio 2 as it just happening with the lico / libregraph-idm setup. I guess this is for somebody else to decide. It however highlights that for productions setups some kind of rate-limiting is needed otherwise it is pretty easy to burn CPU cycles and affect service quality by just attempting to login over and over again.