trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

ATS 9.1.2 - parent_select/nexthop - failure threshold

Open jp557198 opened this issue 3 years ago • 10 comments

re branch - https://github.com/apache/trafficserver/tree/cb7eda60dc8c2068afd458fd20a8dd6232887f0d

I notice that the nexthop failure count upon a network timeout/event never increments beyond 2. With a failure threshold of 10, requests that land on this 'nexthop' will always have to incur a 2s timeout before moving onto the nexthop in the list. as the threshold is never reached per nexthop's failure tracking.

Failure counts using parent_select works as expected, Thus the failing parent is taken out of rotation per the parent retry timer/window(300)

Test case and other testing parameters found in the below comments.

See the below for a debug output comparison between next_hop and parent_select. One increments failure count(parent_select) and one does not(next_hop).

next_hop

[Apr 23 01:49:27.883] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [75] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:30.308] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [76] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:32.366] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [76] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:34.481] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [77] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:37.474] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [77] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:39.596] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [78] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:41.599] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [78] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:44.439] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [79] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:47.434] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [79] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:49.633] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [80] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:52.639] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [80] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:55.474] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [81] Parent fail count increased to 2 for 192.168.72.208 [Apr 23 01:49:58.468] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [81] Parent fail count increased to 2 for 192.168.72.208

parent_select

[Apr 23 02:11:12.507] [ET_NET 15] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 2 for 192.168.72.208:80 [Apr 23 02:11:14.595] [ET_NET 16] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 3 for 192.168.72.208:80 [Apr 23 02:11:17.589] [ET_NET 17] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 4 for 192.168.72.208:80 [Apr 23 02:11:20.435] [ET_NET 18] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 5 for 192.168.72.208:80 [Apr 23 02:11:22.602] [ET_NET 19] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 6 for 192.168.72.208:80 [Apr 23 02:11:25.587] [ET_NET 0] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 7 for 192.168.72.208:80 [Apr 23 02:11:28.353] [ET_NET 1] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 8 for 192.168.72.208:80 [Apr 23 02:11:30.795] [ET_NET 2] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 9 for 192.168.72.208:80 [Apr 23 02:11:33.758] [ET_NET 3] DEBUG: <ParentSelectionStrategy.cc:86 (markParentDown)> (parent_select) Parent fail count increased to 10 for 192.168.72.208:80

jp557198 avatar Apr 25 '22 21:04 jp557198

strategies.yaml

groups:

  • &nhg1

    • host: origin-active.com

      protocol:

      • scheme: http

        port: 80

        health_check_url: http://origin-active.com:80

  • &nhg2

    • host: origin-backup.com

      protocol:

      • scheme: http

        port: 80

        health_check_url: http://origin-backup.com:80

strategies:

  • strategy: 'nh-1'

    policy: consistent_hash

    go_direct: false

    parent_is_proxy: false

    groups:

    • *nhg1

    • *nhg2

    scheme: http

    failover:

    ring_mode: alternate_ring

    response_codes:

    - 404
    
    - 503
    

    health_check:

jp557198 avatar Apr 25 '22 22:04 jp557198

remap line:

map http://edge.com/ http://origin.com/ @strategy=nh-1

parent related config params:

proxy.config.http.parent_proxy.total_connect_attempts: 4

proxy.config.http.uncacheable_requests_bypass_parent: 0

proxy.config.http.parent_proxy.connect_attempts_timeout: 2

proxy.config.http.parent_proxy.retry_time: 300

proxy.config.http.parent_proxy.fail_threshold: 10

proxy.config.http.parent_proxy.per_parent_connect_attempts: 1

proxy.config.http.no_dns_just_forward_to_parent: 0

proxy.config.http.parent_proxies: NULL

proxy.config.http.parent_proxy.mark_down_hostdb: 0

proxy.config.http.parent_proxy.self_detect: 2

proxy.config.http.forward.proxy_auth_to_parent: 0

proxy.local.http.parent_proxy.disable_connect_tunneling: 0

jp557198 avatar Apr 25 '22 22:04 jp557198

Test case.

Send requests that force a request back to origin. Amount of requests should exceed failure threshold(10) and should be sent in the amount of time not exceeding the retry window(300s)

The below sends 20 requests. Each request appended with a random query string to bust cache and force a request upstream.

for x in {1..20}; do curl -H HOST:edge.com -o /dev/null -v http://edge.com/?$RANDOM; done

jp557198 avatar Apr 25 '22 22:04 jp557198

Expect to see threshold count not increment 2.

[Apr 25 21:52:15.155] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [3] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:01.119] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [4] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:15.548] [ET_NET 19] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [5] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:47.333] [ET_NET 0] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [26] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:49.482] [ET_NET 1] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [27] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:52.077] [ET_NET 2] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [28] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:54.674] [ET_NET 3] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [29] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:57.308] [ET_NET 4] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [30] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:53:59.874] [ET_NET 5] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [31] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:02.755] [ET_NET 6] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [32] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:05.391] [ET_NET 7] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [33] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:08.072] [ET_NET 8] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [34] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:10.563] [ET_NET 9] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [35] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:12.762] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [36] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:15.119] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [37] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:17.258] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [38] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:19.917] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [39] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:22.164] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [40] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:24.641] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [41] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:27.168] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [42] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:29.743] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [43] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:32.692] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [44] Parent fail count increased to 2 for origin-active.com

[Apr 25 21:54:35.252] [ET_NET 19] DEBUG: <NextHopHealthStatus.cc:139 (markNextHop)> (next_hop) [45] Parent fail count increased to 2 for origin-active.com

jp557198 avatar Apr 25 '22 22:04 jp557198

Compiled 'proxy/http/remap/NextHopHealthStatus.cc' against our internal 9.1.2 branch. Failure theshold is incrmeneted as expected with 'down' server being marked as unavailable. In the below access log sample, note the 3s ttms while the down server is being tried and timedout. After try 9, next_hop marks the server as 'unavailable' and immediately sends the request to the available server. Threshold is set to 10.

#cqts|cqtn|chi|chp|phi|caun|cqhl|cqcl|pssc|crc|ttms|cfsc|pshl|pscl|psql|cqhm|cqus|cquuh|cquup|psct|phr|nhi|shn|"Age"|"UserAgent"|X-Forwarded-For|Cache-Control|pqhl|pqcl|pfsc|sshl|sscl|sssc|cqpv|Referer|c-Range|s-Range>|stms|xx|ts_ms_ua_begin|xxx|ats_int

1651526642|02/May/2022:21:24:02 -0000|172.16.21.222|32608|0.0.0.0|-|92|0|200|TCP_MISS|3933|FIN|235|0|235|HEAD|http|edge.com|/?16786|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-2"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|3|-|1651526642266|1|- 1651526646|02/May/2022:21:24:06 -0000|172.16.21.222|32612|0.0.0.0|-|91|0|200|TCP_MISS|3023|FIN|235|0|235|HEAD|http|edge.com|/?8174|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-3"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526646218|1|- 1651526649|02/May/2022:21:24:09 -0000|172.16.21.222|32616|0.0.0.0|-|91|0|200|TCP_MISS|3534|FIN|235|0|235|HEAD|http|edge.com|/?5077|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-4"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526649260|1|- 1651526652|02/May/2022:21:24:12 -0000|172.16.21.222|32620|0.0.0.0|-|92|0|200|TCP_MISS|3470|FIN|235|0|235|HEAD|http|edge.com|/?24473|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-5"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|4|-|1651526652809|2|- 1651526656|02/May/2022:21:24:16 -0000|172.16.21.222|32624|0.0.0.0|-|92|0|200|TCP_MISS|3542|FIN|235|0|235|HEAD|http|edge.com|/?31668|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-6"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526656297|1|- 1651526659|02/May/2022:21:24:19 -0000|172.16.21.222|32628|0.0.0.0|-|92|0|200|TCP_MISS|3725|FIN|235|0|235|HEAD|http|edge.com|/?10137|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"4"|"try-7"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526659858|0|- 1651526663|02/May/2022:21:24:23 -0000|172.16.21.222|32632|0.0.0.0|-|92|0|200|TCP_MISS|3121|FIN|235|0|235|HEAD|http|edge.com|/?31030|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-8"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526663601|1|- 1651526666|02/May/2022:21:24:26 -0000|172.16.21.222|32636|0.0.0.0|-|91|0|200|TCP_MISS|3248|FIN|235|0|235|HEAD|http|edge.com|/?4608|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"3"|"try-9"|-|-|228|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526666742|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32640|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?24660|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-10"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670007|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32642|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?22349|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-11"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526670027|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32644|0.0.0.0|-|93|0|200|TCP_MISS|2|FIN|235|0|235|HEAD|http|edge.com|/?10863|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-12"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670049|1|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32646|0.0.0.0|-|92|0|200|TCP_MISS|7|FIN|235|0|235|HEAD|http|edge.com|/?4679|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-13"|-|-|229|0|FIN|238|0|200|http/1.1|-|-|-|0|-|1651526670067|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32648|0.0.0.0|-|93|0|200|TCP_MISS|3|FIN|235|0|235|HEAD|http|edge.com|/?16688|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-14"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|1|-|1651526670087|0|- 1651526670|02/May/2022:21:24:30 -0000|172.16.21.222|32650|0.0.0.0|-|93|0|200|TCP_MISS|6|FIN|235|0|235|HEAD|http|edge.com|/?26145|text/html|PARENT_HIT|172.16.21.223|origin-backup.com|"0"|"try-15"|-|-|230|0|FIN|238|0|200|http/1.1|-|-|-|2|-|1651526670105|1|-

jp557198 avatar May 02 '22 22:05 jp557198

debug shows the failure count incrementing as expected.

[May 2 21:24:02.246] [ET_NET 10] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [97] Parent fail count increased to 2 for origin-active.com [May 2 21:24:06.197] [ET_NET 11] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [98] Parent fail count increased to 3 for origin-active.com [May 2 21:24:09.239] [ET_NET 12] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [99] Parent fail count increased to 4 for origin-active.com [May 2 21:24:12.792] [ET_NET 13] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [100] Parent fail count increased to 5 for origin-active.com [May 2 21:24:16.275] [ET_NET 14] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [101] Parent fail count increased to 6 for origin-active.com [May 2 21:24:19.836] [ET_NET 15] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [102] Parent fail count increased to 7 for origin-active.com [May 2 21:24:23.581] [ET_NET 16] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [103] Parent fail count increased to 8 for origin-active.com [May 2 21:24:26.720] [ET_NET 17] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [104] Parent fail count increased to 9 for origin-active.com [May 2 21:24:29.988] [ET_NET 18] DEBUG: <NextHopHealthStatus.cc:138 (markNextHop)> (next_hop) [105] Parent fail count increased to 10 for origin-active.com

jp557198 avatar May 02 '22 22:05 jp557198

I can take this if John decides his retirement doesn't mean he's our no-cost slave.

ywkaras avatar May 02 '22 23:05 ywkaras

I can take this if John decides his retirement doesn't mean he's our no-cost slave.

no need. issue has been resolved with john's patch.

jp557198 avatar May 03 '22 12:05 jp557198

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

github-actions[bot] avatar May 04 '23 01:05 github-actions[bot]

@jp557198 Was this issue accidentally reopened after being closed?

JosiahWI avatar Aug 08 '24 17:08 JosiahWI