trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

ATS SM hang when trying to get chunked content from cache

Open wjnforever opened this issue 7 years ago • 4 comments

Recently in our CDN project, we found an issue when client do a continuous access from Nginx to ATS(both 5.3.2 and 6.x), for example like 'for i in {1..10}; do curl -v -H"host:test.nginx.com" "http://test.nginx.com" -x127.1:80 -o /dev/null; done' "127.1:80" is Nginx interface which will be proxied to ATS, and then be proxied to final origin server. While origin server reply with different response and one of them is empty(content-length=0), then after this empty response replace the previous non-empty cache in ATS, client get stuck until download timeout. We try to simulate the problem with the following simple website

from BaseHTTPServer import BaseHTTPRequestHandler import random a = 1 class TodoHandler(BaseHTTPRequestHandler): protocol_version = 'HTTP/1.1' def do_GET(self): global a #r = random.randint(0,1) if a % 2 == 0: r = 0 else: r = 1 self.send_response(200) ##r = 0 if r == 0: uu = "Not Found" self.send_header("Connection", "keep-alive") self.send_header("Content-type", "text/html") self.send_header("Content-Length", str(len(uu))) self.send_header("Cache-Control", "max-age=10") self.send_header("Expires", "Thu, 01 Jan 1970 00:00:00 GMT") self.send_header("Accept-Ranges", "bytes") self.end_headers() print(a, ": length", str(len(uu), r) self.wfile.write(uu) else: #self.send_header("Content-type", "video/mp4") self.send_header("Connection", "keep-alive") self.send_header("Cache-Control", "max-age=10") self.send_header("Content-Length", 0) self.send_header("Accept-Ranges", "bytes") self.send_header("Connection", "keep-alive") self.end_headers() print(a, ": length 0", r) a += 1

if name == 'main': # Start a simple server, and loop forever from BaseHTTPServer import HTTPServer server = HTTPServer(('0.0.0.0', 8081), TodoHandler) print("Starting server, use <Ctrl-C> to stop") server.serve_forever()

After checking code and log, we found this zero length response will lead to a resident alternate rewritten int CacheVC::updateVector(int /* event ATS_UNUSED /, Event * / e ATS_UNUSED */) if (small_doc && have_res_alt && (fragment || (f.update && !total_len))) { ... f.rewrite_resident_alt = 1; write_len = doc->data_len(); Debug("cache_update_alt", "rewriting resident alt size: %d key: %X, first_key: %X", write_len, doc->key.slice32(0), first_key.slice32(0)); }

but only replace response header in CacheHTTPInfoVector and still keep the previous content which is not as desired. static int agg_copy(char *p, CacheVC *vc) #ifdef HTTP_CACHE if (vc->f.rewrite_resident_alt) { ink_assert(vc->f.use_first_key); Doc *res_doc = (Doc *)vc->first_buf->data(); res_alt_blk = new_IOBufferBlock(vc->first_buf, res_doc->data_len(), sizeofDoc + res_doc->hlen); doc->key = res_doc->key; doc->total_len = res_doc->data_len(); } #endif

#ifdef HTTP_CACHE if (vc->f.rewrite_resident_alt) iobufferblock_memcpy(doc->data(), vc->write_len, res_alt_blk, 0); else #endif

As a result when found content-length(0) in header doesn't match the object length in cache , ATS make it a chunked reponse with previous content. void HttpTransact::handle_content_length_header(State *s, HTTPHdr *header, HTTPHdr *base) if ((int64_t)s->cache_info.object_read->object_size_get() == cl) { s->hdr_info.trust_response_cl = true; } else { DebugTxn("http_trans", "Content Length header and cache object size mismatch." "Disabling keep-alive"); s->hdr_info.trust_response_cl = false; } Then this transform stuck at the user agent [VC_EVENT_WRITE_READY] until timeout finally, and couldn't serve next request. Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_chunk) creating a chunk of size 9 bytes [Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_redirect) [HttpTunnel::producer_handler] enable_redirection: [1 0 0] event: 102 [Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http) [3] [&HttpSM::tunnel_handler_cache_read, VC_EVENT_READ_COMPLETE] [Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (cache_close) do_io_close 0x7f887f9c0000 -1 1 [Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (cache_free) free 0x7f887f9c0000 [Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_tunnel) [3] consumer_handler [user agent VC_EVENT_WRITE_READY]

Now we simply modify agg_copy from if (vc->f.update && vc->total_len) { CacheHTTPInfo *http_info = vc->write_vector->get(vc->alternate_index); http_info->object_size_set(vc->total_len); } to if (vc->f.update) { CacheHTTPInfo *http_info = vc->write_vector->get(vc->alternate_index); http_info->object_size_set(vc->total_len); } which keep object size as received one to make it work but don't know if already patched and why this stuck happen. pls refer to attached log for more info.

wjnforever avatar Jun 24 '18 14:06 wjnforever

hang.txt

wjnforever avatar Jun 24 '18 14:06 wjnforever

We should verify this is not an issue with the upcoming release (9.0.0).

bryancall avatar May 30 '19 20:05 bryancall

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

github-actions[bot] avatar Nov 09 '21 01:11 github-actions[bot]

FYI, this seems very similar, if not identical, to the issue described in #8779.

elsloo avatar Apr 22 '22 21:04 elsloo