chef-server
chef-server copied to clipboard
oc_erchef: oc_chef_authz timeout will not cause 500 error
Description
oc_erchef has a timeout for oc_chef_authz connections that defaults to 2000ms. For organizations with thousands of clients, this timeout can be exceeded when querying the clients
group. The resulting timeout aborts the query for hydrating said group, however the timeout is masked from the end user. Erchef does not generate an error but actually returns an empty clients group. All this is happens with the user running the query left in the dark.
For organizations that exceed this timeout, things can become problematic very quickly, especially when restoring ec backups. Without knowing any better, the user restores a backup with an empty clients group wreaking havoc on all subsequent client authorization requests.
Expected Behavior
knife group show clients
should return a fully hydrated and valid clients
group, or return a 500 error, if the timeout was exceeded.
Current Behavior
knife group show clients
will return an empty clients
group, and no errors, if the timeout was exceeded.
Steps to Reproduce (for bugs)
- Add several thousand clients to an org
- Artificially introduce a bottleneck by decreasing the following timeout:
{oc_chef_authz, [
{authz_root_url, "http://127.0.0.1:9463" },
{authz_service, [
{root_url, "http://127.0.0.1:9463" },
%% This was the default {timeout, 2000},
{timeout, 20},
{init_count, 25},
{max_count, 100},
{queue_max, 50},
{cull_interval, {1, min}},
{max_age, {70, sec}},
{max_connection_duration, {70, sec}},
{ibrowse_options, [{connect_timeout, 5000}]}
]},
{cleanup_batch_size, 0}
]},
- Restart erchef and run
knife group show clients
You will see an empty clients
group, something like:
~/ (master *%=)$ knife group show clients -Fjson
{
"actors": [
],
"users": [
],
"clients": [
],
"groups": [
],
"orgname": "brewinc",
"name": "clients",
"groupname": "clients"
}
~ (master *%=)$
Even though the org has thousands of clients:
root@chef-server:~# echo "SELECT COUNT(*) FROM clients WHERE org_id='db33b7b65137ebf5479510e46dd02fbf';" | su -l opscode-pgsql -c 'psql opscode_chef'
count
-------
12537
(1 row)
root@chef-server:~#
- Increase the timeout and repeat step 3 with expected results (clients group fully hydrated)
Server Logs
These are the logs during a timeout. All 200's and a postgresql error.
==> /var/log/opscode/postgresql/9.6/current <==
2018-05-17_00:31:00.56390 ERROR: malformed array literal: "error"
2018-05-17_00:31:00.56392 DETAIL: Array value must start with "{" or dimension information.
2018-05-17_00:31:00.56392 STATEMENT: SELECT name, authz_id FROM clients WHERE authz_id = ANY($1)
2018-05-17_00:31:00.56480 ERROR: malformed array literal: "req_timedout"
2018-05-17_00:31:00.56481 DETAIL: Array value must start with "{" or dimension information.
2018-05-17_00:31:00.56481 STATEMENT: SELECT org_id, name, authz_id FROM groups WHERE authz_id =
ANY($1)
==> /var/log/opscode/nginx/access.log <==
::ffff:192.168.200.1 - - [17/May/2018:00:31:00 +0000] "GET /organizations/brewinc/groups/clients
HTTP/1.1" 200 "0.045" 95 "-" "Chef Knife/14.1.12 (ruby-2.5.1-p57; ohai-14.1.3; x86_64-darwin15;
+https://chef.io)" "127.0.0.1:8000" "200" "0.044" "14.1.12" "algorithm=sha1;version=1.1;" "pivotal"
"2018-05-17T00:31:00Z" "2jmj7l5rSw0yVb/vlWAYkK/YBwk=" 1042
==> /var/log/opscode/opscode-erchef/requests.log.2 <==
2018-05-17T00:31:00Z [email protected] method=GET; path=/organizations/brewinc/groups/clients;
status=200; req_id=g3IAA2QAEGVyY2hlZkAxMjcuMC4wLjECAAAXGgAAAAEAAAAA; org_name=brewinc;
couchdb_groups=false; couchdb_organizations=false; couchdb_containers=false; couchdb_acls=false;
503_mode=false; couchdb_associations=false; couchdb_association_requests=false; req_time=40;
rdbms_time=25; rdbms_count=3; authz_time=12; authz_count=1; user=pivotal; req_api_version=1;
==> /var/log/opscode/oc_bifrost/requests.log.2 <==
2018-05-17T00:31:00Z [email protected] method=GET;
path=/groups/a33c7d9528d8787e85c102351f7c591a/acl/read/actors/25c3cb0f37873779a10fa341e11eac19;
status=200; requestor_id=25c3cb0f37873779a10fa341e11eac19; req_time=9;
rdbms.bifrost_db.has_permission_time=4; rdbms.bifrost_db.has_permission_count=1;
rdbms.bifrost_db.exists_time=3; rdbms.bifrost_db.exists_count=1;
2018-05-17T00:31:00Z [email protected] method=GET;
path=/groups/a33c7d9528d8787e85c102351f7c591a; status=200;
requestor_id=25c3cb0f37873779a10fa341e11eac19; req_time=60; rdbms.bifrost_db.has_permission_time=2;
rdbms.bifrost_db.has_permission_count=1; rdbms.bifrost_db.exists_time=0;
rdbms.bifrost_db.exists_count=1; rdbms.bifrost_db.group_membership_time=50;
rdbms.bifrost_db.group_membership_count=2;
SQL
When I turn the timeout back up and set log_min_duration_statement
down to 15 I see this
==> /var/log/opscode/postgresql/9.6/current <==
2018-05-17_19:53:04.38559 LOG: duration: 48.068 ms execute find_client_name_in_authz_ids: SELECT name, authz_id FROM clients WHERE authz_id = ANY($1)
2018-05-17_19:53:04.38561 DETAIL: parameters: $1 = '{f64159201ded77b6c791625ebaee2dda, ..., many many more uuids}'
awesome information @jeremymv2 - couple of things I'd love to see in addition to the above info.
if you crank the timeout back up above the success threshold, and turn the postgres log_min_duration
setting down to 100, can you grab the query that postgres logs? In addition to that, it would be useful to see an EXPLAIN on that query, plus the request.logs from both erchef and bifrost for that query
Thanks @irvingpop I'm adding the info into the main issue comment.
Interesting. It looks like we're not trapping 'req_timedout' before blindly passing it off to sql. We certainly should catch that value in erchef and return a 500 instead, so at least we know we had a problem instead of silently returning bad data.
It would be nice to be able to get the actual group membership though, but that could be a bit thorny. Increasing that timeout would stave off the issue, but not really prevent it. It wouldn't be useful to increase the timeout much past 5 seconds because all the gen_server calls in the path have a default of 5 seconds, and we'd move the failure there. The bifrost API is pretty much all or nothing, and changing that would be a bit of work.
thanks for thorough description, was wondering why client group members were suddenly gone