netbird
netbird copied to clipboard
Netbird behind Cloudflare: stream terminated by RST_STREAM
Describe the problem I use Cloudflare proxy my netbird management and signal service, But netbird client got disconnect and reconnect every 100 second.
this caused by cloudflare 524 timeoout: https://developers.cloudflare.com/support/troubleshooting/cloudflare-errors/troubleshooting-cloudflare-5xx-errors/#error-524-a-timeout-occurred.
Error 524 indicates that Cloudflare successfully connected to the origin web server, but the origin did not provide an HTTP response before the default 100 second connection timed out.
Expected behavior Add grpc keep alive for management and signal service, adapt to a more complex network environment
NetBird status -d output: Daemon version: 0.24.3 CLI version: 0.24.3 Management: Connected to https://nb.xxx.com:8443/ Signal: Connected to https://nb.xxx.com:8443/ FQDN: m1.nb.iot NetBird IP: 100.64.0.3/16 Interface type: Kernel Peers count: 3/4 Connected
Additional context
WARN management/client/grpc.go:158: disconnected from the Management service but will retry silently. Reason: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR
WARN signal/client/grpc.go:151: disconnected from the Signal Exchange due to an error: rpc error: code = Unknown desc = unexpected HTTP status code received from server: 524 (); transport: received unexpected content-type "text/plain; charset=UTF-8"
Related issue #771 #651
Any update?
I've got this issue too, but it's not really Netbird's problem. The cause of this is Cloudflare's opinionated config which is 100s for their non-enterprise customers and up to 6000s for enterprise.
On the other hand, I'm guessing that implementing a keepalive on management and signal services could improve Netbird experience for normal operation too (non-Cloudflare). That's because corporate proxies and routers also have their own performance-oriented inactive connection timeouts just like Cloudflare, so this could be an overall performance win.
Any workround for using with cloudflare tunnel? I tried increasing the keep alive time through the Origin configurations, still I am getting the same error!
+1 on this one. It’s not only cloudflare that has such kind of inactivity timeouts. For example, ingress-nginx also terminates inactive gRPC connections, and as of today it requires messing with the configuration-snippet to get netbird to work properly. Implementing app-level keepalive would definitely improve user experience for more sophisticated deployments. Looking forward to seeing this implemented!
+1 on this one. It’s not only cloudflare that has such kind of inactivity timeouts. For example, ingress-nginx also terminates inactive gRPC connections, and as of today it requires messing with the configuration-snippet to get netbird to work properly. Implementing app-level keepalive would definitely improve user experience for more sophisticated deployments. Looking forward to seeing this implemented!
Traefik also implements a default readTimeout setting of 60 seconds on its entrypoints. As I understand it is security control to limit DDoS attacks. Choosing to disable readTimeouts is a security risk that each organization would have to accept in order to run Netbird behind a proxy. Furthermore, Cloudflare customers can not disable read timeouts, with a max available setting of 6000s for their enterprise customers.
Edit: To anyone interested you can use Traefik's "reusePort" function on the entry point declarations to duplicate the entry point and then specify specific read timeout settings that only applies per entry point.