ATS 9.1.2 - parent_select/nexthop - failure threshold
re branch - https://github.com/apache/trafficserver/tree/cb7eda60dc8c2068afd458fd20a8dd6232887f0d
I notice that the nexthop failure count upon a network timeout/event never increments beyond 2. With a failure threshold of 10, requests that land on this 'nexthop' will always have to incur a 2s timeout before moving onto the nexthop in the list. as the threshold is never reached per nexthop's failure tracking.
Failure counts using parent_select works as expected, Thus the failing parent is taken out of rotation per the parent retry timer/window(300)
Test case and other testing parameters found in the below comments.
See the below for a debug output comparison between next_hop and parent_select. One increments failure count(parent_select) and one does not(next_hop).
next_hop
[Apr 23 01:49:27.883] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [75] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:30.308] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [76] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:32.366] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [76] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:34.481] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [77] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:37.474] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [77] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:39.596] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [78] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:41.599] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [78] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:44.439] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [79] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:47.434] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [79] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:49.633] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [80] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:52.639] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [80] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:55.474] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [81] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:58.468] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [81] Parent fail count increased to 2 for 192.168.72.208
parent_select
[Apr 23 02:11:12.507] [ET_NET 15] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 2 for 192.168.72.208:80 [Apr 23 02:11:14.595] [ET_NET 16] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 3 for 192.168.72.208:80 [Apr 23 02:11:17.589] [ET_NET 17] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 4 for 192.168.72.208:80 [Apr 23 02:11:20.435] [ET_NET 18] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 5 for 192.168.72.208:80 [Apr 23 02:11:22.602] [ET_NET 19] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 6 for 192.168.72.208:80 [Apr 23 02:11:25.587] [ET_NET 0] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 7 for 192.168.72.208:80 [Apr 23 02:11:28.353] [ET_NET 1] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 8 for 192.168.72.208:80 [Apr 23 02:11:30.795] [ET_NET 2] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 9 for 192.168.72.208:80 [Apr 23 02:11:33.758] [ET_NET 3] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 10 for 192.168.72.208:80
strategies.yaml
groups:
-
&nhg1
-
host: origin-active.com
protocol:
-
scheme: http
port: 80
health_check_url: http://origin-active.com:80
-
-
-
&nhg2
-
host: origin-backup.com
protocol:
-
scheme: http
port: 80
health_check_url: http://origin-backup.com:80
-
-
strategies:
-
strategy: 'nh-1'
policy: consistent_hash
go_direct: false
parent_is_proxy: false
groups:
-
*nhg1
-
*nhg2
scheme: http
failover:
ring_mode: alternate_ring
response_codes:
- 404 - 503health_check:
-
remap line:
map http://edge.com/ http://origin.com/ @strategy=nh-1
parent related config params:
proxy.config.http.parent_proxy.total_connect_attempts: 4
proxy.config.http.uncacheable_requests_bypass_parent: 0
proxy.config.http.parent_proxy.connect_attempts_timeout: 2
proxy.config.http.parent_proxy.retry_time: 300
proxy.config.http.parent_proxy.fail_threshold: 10
proxy.config.http.parent_proxy.per_parent_connect_attempts: 1
proxy.config.http.no_dns_just_forward_to_parent: 0
proxy.config.http.parent_proxies: NULL
proxy.config.http.parent_proxy.mark_down_hostdb: 0
proxy.config.http.parent_proxy.self_detect: 2
proxy.config.http.forward.proxy_auth_to_parent: 0
proxy.local.http.parent_proxy.disable_connect_tunneling: 0
Test case.
Send requests that force a request back to origin. Amount of requests should exceed failure threshold(10) and should be sent in the amount of time not exceeding the retry window(300s)
The below sends 20 requests. Each request appended with a random query string to bust cache and force a request upstream.
for x in {1..20}; do curl -H HOST:edge.com -o /dev/null -v http://edge.com/?$RANDOM; done
Expect to see threshold count not increment 2.
[Apr 25 21:52:15.155] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [3] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:01.119] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [4] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:15.548] [ET_NET 19] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [5] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:47.333] [ET_NET 0] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [26] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:49.482] [ET_NET 1] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [27] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:52.077] [ET_NET 2] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [28] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:54.674] [ET_NET 3] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [29] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:57.308] [ET_NET 4] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [30] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:53:59.874] [ET_NET 5] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [31] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:02.755] [ET_NET 6] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [32] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:05.391] [ET_NET 7] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [33] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:08.072] [ET_NET 8] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [34] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:10.563] [ET_NET 9] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [35] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:12.762] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [36] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:15.119] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [37] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:17.258] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [38] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:19.917] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [39] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:22.164] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [40] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:24.641] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [41] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:27.168] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [42] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:29.743] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [43] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:32.692] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [44] Parent fail count increased to 2 for origin-active.com
[Apr 25 21:54:35.252] [ET_NET 19] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [45] Parent fail count increased to 2 for origin-active.com
Compiled 'proxy/http/remap/NextHopHealthStatus.cc' against our internal 9.1.2 branch. Failure theshold is incrmeneted as expected with 'down' server being marked as unavailable. In the below access log sample, note the 3s ttms while the down server is being tried and timedout. After try 9, next_hop marks the server as 'unavailable' and immediately sends the request to the available server. Threshold is set to 10.
#cqts|cqtn|chi|chp|phi|caun|cqhl|cqcl|pssc|crc|ttms|cfsc|pshl|pscl|psql|cqhm|cqus|cquuh|cquup|psct|phr|nhi|shn|"Age"|"UserAgent"|X-Forwarded-For|Cache-Control|pqhl|pqcl|pfsc|sshl|sscl|sssc|cqpv|Referer|c-Range|s-Range>|stms|xx|ts_ms_ua_begin|xxx|ats_int
1651526642|02/May/2022:21:24:02 -0000|172.16.21.222|32608|0.0.0.0|-|92|0|200|TCP_MISS|3933|FIN|235|0|235|HEAD|http|edge.com|/?16786|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-2"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|3|-|1651526642266|1|- 1651526646|02/May/2022:21:24:06 -0000|172.16.21.222|32612|0.0.0.0|-|91|0|200|TCP_MISS|3023|FIN|235|0|235|HEAD|http|edge.com|/?8174|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-3"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526646218|1|- 1651526649|02/May/2022:21:24:09 -0000|172.16.21.222|32616|0.0.0.0|-|91|0|200|TCP_MISS|3534|FIN|235|0|235|HEAD|http|edge.com|/?5077|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-4"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526649260|1|- 1651526652|02/May/2022:21:24:12 -0000|172.16.21.222|32620|0.0.0.0|-|92|0|200|TCP_MISS|3470|FIN|235|0|235|HEAD|http|edge.com|/?24473|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-5"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|4|-|1651526652809|2|- 1651526656|02/May/2022:21:24:16 -0000|172.16.21.222|32624|0.0.0.0|-|92|0|200|TCP_MISS|3542|FIN|235|0|235|HEAD|http|edge.com|/?31668|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-6"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526656297|1|- 1651526659|02/May/2022:21:24:19 -0000|172.16.21.222|32628|0.0.0.0|-|92|0|200|TCP_MISS|3725|FIN|235|0|235|HEAD|http|edge.com|/?10137|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-7"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526659858|0|- 1651526663|02/May/2022:21:24:23 -0000|172.16.21.222|32632|0.0.0.0|-|92|0|200|TCP_MISS|3121|FIN|235|0|235|HEAD|http|edge.com|/?31030|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-8"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526663601|1|- 1651526666|02/May/2022:21:24:26 -0000|172.16.21.222|32636|0.0.0.0|-|91|0|200|TCP_MISS|3248|FIN|235|0|235|HEAD|http|edge.com|/?4608|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-9"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526666742|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32640|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?24660|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-10"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670007|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32642|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?22349|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-11"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526670027|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32644|0.0.0.0|-|93|0|200|TCP_MISS|2|FIN|235|0|235|HEAD|http|edge.com|/?10863|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-12"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670049|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32646|0.0.0.0|-|92|0|200|TCP_MISS|7|FIN|235|0|235|HEAD|http|edge.com|/?4679|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-13"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|0|-|1651526670067|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32648|0.0.0.0|-|93|0|200|TCP_MISS|3|FIN|235|0|235|HEAD|http|edge.com|/?16688|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-14"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670087|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32650|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?26145|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-15"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526670105|1|-
debug shows the failure count incrementing as expected.
[May 2 21:24:02.246] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [97] Parent fail count increased to 2 for origin-active.com [May 2 21:24:06.197] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [98] Parent fail count increased to 3 for origin-active.com [May 2 21:24:09.239] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [99] Parent fail count increased to 4 for origin-active.com [May 2 21:24:12.792] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [100] Parent fail count increased to 5 for origin-active.com [May 2 21:24:16.275] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [101] Parent fail count increased to 6 for origin-active.com [May 2 21:24:19.836] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [102] Parent fail count increased to 7 for origin-active.com [May 2 21:24:23.581] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [103] Parent fail count increased to 8 for origin-active.com [May 2 21:24:26.720] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [104] Parent fail count increased to 9 for origin-active.com [May 2 21:24:29.988] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [105] Parent fail count increased to 10 for origin-active.com
I can take this if John decides his retirement doesn't mean he's our no-cost slave.
I can take this if John decides his retirement doesn't mean he's our no-cost slave.
no need. issue has been resolved with john's patch.
This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.
@jp557198 Was this issue accidentally reopened after being closed?