tornado http server occasionally experiences high latency

1、Environment Information Tornado Version: TornadoServer/6.1 Python Version: 3.7 Docker Version: 20.10.7

My Tornado Web Server runs in K8s, so the traffic is forwarded by Envoy. The client uses HTTP1.1 requests by default, so HTTP Keep-Alive is used between Envoy and Tornado Pod Envoy--->Tornado Pod

Tornado startup sample code is as follows: Tornado Pod cpu limit=4

if __name__ == "__main__":
    if conf.IS_BACKUP:
        application = Application([
            (r"/(\w+)", TestHandlerBase)], **settings)
    else:
        application = Application([
            (r"/(\w+)", TestHandlerBase)])

    num_processes = conf.TORNADO_PROCESS_NUM
    _initialization(num_processes)

    server = httpserver.HTTPServer(application)
    server.bind('9880', '0.0.0.0', backlog=2048)
    server.start(num_processes)

    skywalking_agent()
    print ('start service...')
    ioloop.IOLoop.instance().start()

2、Problems encountered When I performed stress testing on the Tornado interface, the original interface response delay should be within 100ms, but after 1 minute of stress testing, the interface response delay rose to more than 300ms, which is intolerable for our business.

We used tcpdump to capture packets and check the communication, and found the following situation: 172.29.222.1 is the envoy ip, 172.29.86.34 is the tornado web pod ip

Serial number 20498: envoy forwards the request to tornado pod (time: 11:17:57.3055) Serial number 20499: tornado pod responds with ACK (time: 11:17:57.3056) Serial number 22235: tornado pod responds with http data (time: 11:17:57.799) The entire HTTP response time is 494ms The request entered the tornado pod at 11:17:57.3056, but we observed the tornado access log and found that tornado actually started processing the request at 11:17:57.751, and completed the processing at 11:17:57.799, sending an http response

We don't know where the time is spent. At the same time, is there any tool to observe the time consumption of this part?

I hope to get your reply. This problem has troubled us for a long time. Thank you~

Sep 03 '24 09:09 shaxiaozz

There are a few likely possibilities here:

Some long-running operation is blocking the event loop thread. Use asyncio debug mode to investigate this possibility
Poor load balancing. I've seen significant imbalances between the processes when running in multi-process mode, so one process can get overloaded while the others are fine. Since you're already running envoy as a proxy here, I wouldn't recommend using tornado's multi-process mode. Instead, run multiple independent copies of the tornado process, each listening on a separate port, and configure envoy to balance the load between them (see https://www.tornadoweb.org/en/stable/guide/running.html, although it covers nginx instead of envoy). This is generally more reliable than just letting all the processes share a port. (I've heard that SO_REUSEPORT can also help load balancing here but I haven't used it myself)
The system is simply overloaded. You didn't share how you're running the load test but if you give the system more load than it can handle, latency will go up. And in asyncio-based systems, this often manifests as a few unlucky requests experiencing significant delays while others are fine (in contrast to a more gradual decline in performance often seen in multithreaded solutions).

Sep 09 '24 18:09 bdarnell

@bdarnell First of all, thank you very much for your reply~

I debugged according to the above 3 points: 1、Enable asyncio debug mode "logging.basicConfig(level=logging.DEBUG)" And "Setting the [PYTHONASYNCIODEBUG] environment variable to 1"

I used Apache JMeter to stress test the Tornado interface, with 500 concurrent requests and a duration of 180 seconds. During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds.

2、Poor load balancing I have tried tornad xheaders=true. And under the same Jmeter pressure in the first point, the performance is slower. I tried to find the definition of the xheaders parameter in the source code, and I only saw "These headers are useful when running Tornado behind a reverse proxy or load balancer".

At the same time, I observed that there is DefaultHostMatches logic in tornado/routing.py, which seems to be used for routing judgment?

I don't know what actions tornado will perform when xheaders=true. I found that this parameter will increase the time consumption when I tested it with Jmeter. Maybe it has something to do with my operation?

3、Tornado uses single-process mode Due to the special business volume, the traffic in the PRD environment will be relatively high, because we will choose multi-threaded, non-blocking startup mode to run when going online.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .

4、Full of resources Since we are based on the Promehtues+Grafana monitoring system, during the stress test, we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

Sep 11 '24 02:09 shaxiaozz

During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds.

These messages that say "poll %.3f ms took %.3f ms" are harmless - they're reporting on internal state of the asyncio event loop (basically, how long the loop sat idle waiting for network activity). This message was removed in python 3.8.

The message you're looking for is "Executing %s took %.3f seconds". If you see any of those, you have a problem with a callback that's running too long.

I don't know what actions tornado will perform when xheaders=true.

This mainly affects logging: With xheaders=True, logs will show the IP address from the X-Real-IP header instead of the IP from the TCP connection (which is the load balancer). There's a tiny amount of extra work involved when xheaders=True but I would be surprised if it were a measurable difference.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .

Hmm, this sounds like the configuration I would recommend so it's worth investigating it further to see what's going on.

we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

OK. The question you need to answer is whether the CPU usage is balanced across all the cores or if some cores are at 100% while others are less. Or, when you run one core per pod with more pods, whether the CPU usage is balanced across the pods.

Sep 27 '24 19:09 bdarnell

@bdarnell First of all, thank you very much for your reply~

I debugged according to the above 3 points: 1、Enable asyncio debug mode "logging.basicConfig(level=logging.DEBUG)" And "Setting the [PYTHONASYNCIODEBUG] environment variable to 1"

I used Apache JMeter to stress test the Tornado interface, with 500 concurrent requests and a duration of 180 seconds. During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds.

2、Poor load balancing I have tried tornad xheaders=true. And under the same Jmeter pressure in the first point, the performance is slower. I tried to find the definition of the xheaders parameter in the source code, and I only saw "These headers are useful when running Tornado behind a reverse proxy or load balancer".

At the same time, I observed that there is DefaultHostMatches logic in tornado/routing.py, which seems to be used for routing judgment?

I don't know what actions tornado will perform when xheaders=true. I found that this parameter will increase the time consumption when I tested it with Jmeter. Maybe it has something to do with my operation?

3、Tornado uses single-process mode Due to the special business volume, the traffic in the PRD environment will be relatively high, because we will choose multi-threaded, non-blocking startup mode to run when going online.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .

4、Full of resources Since we are based on the Promehtues+Grafana monitoring system, during the stress test, we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

Can you give a complete code example?

Oct 23 '24 09:10 mirrornight

A complete code example for what?

Apr 28 '25 17:04 bdarnell