cloudflared
cloudflared copied to clipboard
🐛 Quic is has slower response times than http2 protocol (confirmed issue with cloudflare support)
Describe the bug
quic is default for cloudflared but it has performance issues as observed in our perf tests and also confirmed by Cloudflare support. The purpose of this ticket is raise this issue in the open and allow a way for the community to keep track of this. It cost us a week of communications with support before it was found that it was a quic vs http2 issue. We had to switch to using http2.
For anyone internal triaging this, the ticket is at https://support.cloudflare.com/hc/en-us/requests/2690173
To Reproduce Steps to reproduce the behavior:
- We are using cloudflare/cloudflared:2023.2.1-arm64
- Create a k8s cluster -- we are testing on GKE.
- Follow the steps at https://developers.cloudflare.com/cloudflare-one/tutorials/many-cfd-one-tunnel/ except that we are using the dashboard to create the mapping. Therefore, we use a TUNNEL_TOKEN.
- For our perf, we set up a deployment using the httpbin docker image.
- We ran a bunch of tests against the httpbin deployment against the
stream/100
endpoint using k6.io using both http2 and quic. - We noticed that overall http2 was much faster in terms of p95 response times. I've attached the graphs
data:image/s3,"s3://crabby-images/5eb46/5eb46f7b5c3343a0dbaa7b008c9418d00f096158" alt="Screenshot 2023-01-30 at 9 44 46 AM"
data:image/s3,"s3://crabby-images/04f27/04f27f661cbfd2487a68bff3c01f4a9e9a395f3c" alt="Screenshot 2023-01-30 at 9 44 50 AM"
Additional context Add any other context about the problem here.
Our Tunnels engineering manager looked at the escalation ticket and he confirmed that using QUIC will be lower than using http/2 as you also noted.
In the case of QUIC vs HTTP2, it is important to note that QUIC transport operates at user space, so it is normal to be more demanding in terms of CPU.
So, if you compare both with the same restrictions, then it is expected for performance to be slower.
Our EM is going to do some benchmark testing next quarter with an eye towards seeing if there are QUIC improvements that can be made.
For now, we are going to set TUNNEL_TRANSPORT_PROTOCOL to http2.
I just ran into this as well.
However, in my case, switching to http2
is not ideal, because I'm using unix domain sockets ( service: unix:/run/foo.sock
) for some upstream origins, and websockets apparently don't work with the combination of http2 + unix domain sockets ( https://github.com/cloudflare/cloudflared/issues/560#issuecomment-1030079345 ) - they only work over cloudflared via the quic
protocol based on the linked issue.
I can confirm that switching the protocol from quic
to http2
makes the connection a lot faster, however, after running for a few hours, requests stop being served:
2023-03-10T16:43:40Z ERR error="stream 309 canceled with error code 0" cfRay=7a5db807bfa7250e-LHR originService=https://nginx-proxy-manager
2023-03-10T16:43:40Z ERR Request failed error="stream 309 canceled with error code 0" connIndex=0 dest=https://example.com ip=123.123.123.123 type=http
2023-03-10T16:43:40Z ERR error="stream 293 canceled with error code 0" cfRay=7a5db7fe28ea250e-LHR originService=https://nginx-proxy-manager
2023-03-10T16:43:40Z ERR Request failed error="stream 293 canceled with error code 0" connIndex=0 dest=https://example.com ip=123.123.123.123 type=http
I ran some throughput tests of my own, since there hasn't been any activity on this issue.
Download of a 1GB test file through cloudflared. I tested from two separate hosts.
Host 1: no tunnel - 149 MB/s protocol: http2 - 152 MB/s protocol: quic - 17 MB/s
Host 2: no tunnel - 89 MB/s protocol: http2 - 94 MB/s protocol: quic - 17 MB/s
CPU usage didn't seem to be an issue, so I'm hoping someone can look into this further.
For anyone internal triaging this, the ticket is at https://support.cloudflare.com/hc/en-us/requests/2690173
This is a 404 for me. Is this a private support ticket? Any updates on that? I have also experienced worse stess test results with QUIC compared to http2.
I can confirm this. Had this issues some years ago (and since used http2). On a new system I forgot about it and had slowness issues for months. After I changed to http2 yesterday, all is fine.