chef-server oc_erchef: oc_chef_authz timeout will not cause 500 error

Description

oc_erchef has a timeout for oc_chef_authz connections that defaults to 2000ms. For organizations with thousands of clients, this timeout can be exceeded when querying the clients group. The resulting timeout aborts the query for hydrating said group, however the timeout is masked from the end user. Erchef does not generate an error but actually returns an empty clients group. All this is happens with the user running the query left in the dark.

For organizations that exceed this timeout, things can become problematic very quickly, especially when restoring ec backups. Without knowing any better, the user restores a backup with an empty clients group wreaking havoc on all subsequent client authorization requests.

Expected Behavior

knife group show clients should return a fully hydrated and valid clients group, or return a 500 error, if the timeout was exceeded.

Current Behavior

knife group show clients will return an empty clients group, and no errors, if the timeout was exceeded.

Steps to Reproduce (for bugs)

Add several thousand clients to an org
Artificially introduce a bottleneck by decreasing the following timeout:

    {oc_chef_authz, [
        {authz_root_url, "http://127.0.0.1:9463" },
        {authz_service, [
            {root_url, "http://127.0.0.1:9463" },
            %% This was the default {timeout, 2000},
            {timeout, 20},
            {init_count, 25},
            {max_count, 100},
            {queue_max, 50},
            {cull_interval, {1, min}},
            {max_age, {70, sec}},
            {max_connection_duration, {70, sec}},
            {ibrowse_options, [{connect_timeout, 5000}]}
        ]},
        {cleanup_batch_size, 0}
    ]},

Restart erchef and run knife group show clients

You will see an empty clients group, something like:

~/ (master *%=)$ knife group show clients -Fjson
{
  "actors": [

  ],
  "users": [

  ],
  "clients": [

  ],
  "groups": [

  ],
  "orgname": "brewinc",
  "name": "clients",
  "groupname": "clients"
}
~ (master *%=)$

Even though the org has thousands of clients:

root@chef-server:~# echo "SELECT COUNT(*) FROM clients WHERE org_id='db33b7b65137ebf5479510e46dd02fbf';" | su -l opscode-pgsql -c 'psql opscode_chef'
 count
-------
 12537
(1 row)

root@chef-server:~#

Increase the timeout and repeat step 3 with expected results (clients group fully hydrated)

Server Logs

These are the logs during a timeout. All 200's and a postgresql error.

==> /var/log/opscode/postgresql/9.6/current <==
2018-05-17_00:31:00.56390 ERROR:  malformed array literal: "error"
2018-05-17_00:31:00.56392 DETAIL:  Array value must start with "{" or dimension information.
2018-05-17_00:31:00.56392 STATEMENT:  SELECT name, authz_id FROM clients WHERE authz_id = ANY($1)
2018-05-17_00:31:00.56480 ERROR:  malformed array literal: "req_timedout"
2018-05-17_00:31:00.56481 DETAIL:  Array value must start with "{" or dimension information.
2018-05-17_00:31:00.56481 STATEMENT:  SELECT org_id, name, authz_id FROM groups WHERE authz_id =
ANY($1)

==> /var/log/opscode/nginx/access.log <==
::ffff:192.168.200.1 - - [17/May/2018:00:31:00 +0000]  "GET /organizations/brewinc/groups/clients
HTTP/1.1" 200 "0.045" 95 "-" "Chef Knife/14.1.12 (ruby-2.5.1-p57; ohai-14.1.3; x86_64-darwin15;
+https://chef.io)" "127.0.0.1:8000" "200" "0.044" "14.1.12" "algorithm=sha1;version=1.1;" "pivotal"
"2018-05-17T00:31:00Z" "2jmj7l5rSw0yVb/vlWAYkK/YBwk=" 1042

==> /var/log/opscode/opscode-erchef/requests.log.2 <==
2018-05-17T00:31:00Z [email protected] method=GET; path=/organizations/brewinc/groups/clients;
status=200; req_id=g3IAA2QAEGVyY2hlZkAxMjcuMC4wLjECAAAXGgAAAAEAAAAA; org_name=brewinc;
couchdb_groups=false; couchdb_organizations=false; couchdb_containers=false; couchdb_acls=false;
503_mode=false; couchdb_associations=false; couchdb_association_requests=false; req_time=40;
rdbms_time=25; rdbms_count=3; authz_time=12; authz_count=1; user=pivotal; req_api_version=1;

==> /var/log/opscode/oc_bifrost/requests.log.2 <==
2018-05-17T00:31:00Z [email protected] method=GET;
path=/groups/a33c7d9528d8787e85c102351f7c591a/acl/read/actors/25c3cb0f37873779a10fa341e11eac19;
status=200; requestor_id=25c3cb0f37873779a10fa341e11eac19; req_time=9;
rdbms.bifrost_db.has_permission_time=4; rdbms.bifrost_db.has_permission_count=1;
rdbms.bifrost_db.exists_time=3; rdbms.bifrost_db.exists_count=1;
2018-05-17T00:31:00Z [email protected] method=GET;
path=/groups/a33c7d9528d8787e85c102351f7c591a; status=200;
requestor_id=25c3cb0f37873779a10fa341e11eac19; req_time=60; rdbms.bifrost_db.has_permission_time=2;
rdbms.bifrost_db.has_permission_count=1; rdbms.bifrost_db.exists_time=0;
rdbms.bifrost_db.exists_count=1; rdbms.bifrost_db.group_membership_time=50;
rdbms.bifrost_db.group_membership_count=2;

SQL

When I turn the timeout back up and set log_min_duration_statement down to 15 I see this

==> /var/log/opscode/postgresql/9.6/current <==
2018-05-17_19:53:04.38559 LOG:  duration: 48.068 ms  execute find_client_name_in_authz_ids: SELECT name, authz_id FROM clients WHERE authz_id = ANY($1)
2018-05-17_19:53:04.38561 DETAIL:  parameters: $1 = '{f64159201ded77b6c791625ebaee2dda, ..., many many more uuids}'

May 16 '18 18:05 jeremymv2

awesome information @jeremymv2 - couple of things I'd love to see in addition to the above info.

if you crank the timeout back up above the success threshold, and turn the postgres log_min_duration setting down to 100, can you grab the query that postgres logs? In addition to that, it would be useful to see an EXPLAIN on that query, plus the request.logs from both erchef and bifrost for that query

May 16 '18 18:05 irvingpop

Thanks @irvingpop I'm adding the info into the main issue comment.

May 17 '18 00:05 jeremymv2

Interesting. It looks like we're not trapping 'req_timedout' before blindly passing it off to sql. We certainly should catch that value in erchef and return a 500 instead, so at least we know we had a problem instead of silently returning bad data.

It would be nice to be able to get the actual group membership though, but that could be a bit thorny. Increasing that timeout would stave off the issue, but not really prevent it. It wouldn't be useful to increase the timeout much past 5 seconds because all the gen_server calls in the path have a default of 5 seconds, and we'd move the failure there. The bifrost API is pretty much all or nothing, and changing that would be a bit of work.

May 18 '18 02:05 markan

thanks for thorough description, was wondering why client group members were suddenly gone

Jun 14 '22 11:06 vytstank

chef-server chef-server copied to clipboard

oc_erchef: oc_chef_authz timeout will not cause 500 error

Description

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Server Logs

SQL

chef-server
chef-server copied to clipboard