netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Netbird behind Cloudflare: stream terminated by RST_STREAM

Open xiaolei0125 opened this issue 2 years ago • 4 comments

Describe the problem I use Cloudflare proxy my netbird management and signal service, But netbird client got disconnect and reconnect every 100 second.

this caused by cloudflare 524 timeoout: https://developers.cloudflare.com/support/troubleshooting/cloudflare-errors/troubleshooting-cloudflare-5xx-errors/#error-524-a-timeout-occurred.

Error 524 indicates that Cloudflare successfully connected to the origin web server, but the origin did not provide an HTTP response before the default 100 second connection timed out.

Expected behavior Add grpc keep alive for management and signal service, adapt to a more complex network environment

NetBird status -d output: Daemon version: 0.24.3 CLI version: 0.24.3 Management: Connected to https://nb.xxx.com:8443/ Signal: Connected to https://nb.xxx.com:8443/ FQDN: m1.nb.iot NetBird IP: 100.64.0.3/16 Interface type: Kernel Peers count: 3/4 Connected

Additional context WARN management/client/grpc.go:158: disconnected from the Management service but will retry silently. Reason: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR WARN signal/client/grpc.go:151: disconnected from the Signal Exchange due to an error: rpc error: code = Unknown desc = unexpected HTTP status code received from server: 524 (); transport: received unexpected content-type "text/plain; charset=UTF-8"

Related issue #771 #651

xiaolei0125 avatar Dec 06 '23 09:12 xiaolei0125

Any update?

xiaolei0125 avatar Mar 05 '24 10:03 xiaolei0125

I've got this issue too, but it's not really Netbird's problem. The cause of this is Cloudflare's opinionated config which is 100s for their non-enterprise customers and up to 6000s for enterprise.

On the other hand, I'm guessing that implementing a keepalive on management and signal services could improve Netbird experience for normal operation too (non-Cloudflare). That's because corporate proxies and routers also have their own performance-oriented inactive connection timeouts just like Cloudflare, so this could be an overall performance win.

horzadome avatar Apr 03 '24 07:04 horzadome

Any workround for using with cloudflare tunnel? I tried increasing the keep alive time through the Origin configurations, still I am getting the same error!

toasterlolz avatar May 13 '24 05:05 toasterlolz

+1 on this one. It’s not only cloudflare that has such kind of inactivity timeouts. For example, ingress-nginx also terminates inactive gRPC connections, and as of today it requires messing with the configuration-snippet to get netbird to work properly. Implementing app-level keepalive would definitely improve user experience for more sophisticated deployments. Looking forward to seeing this implemented!

interna1error avatar May 17 '24 05:05 interna1error

+1 on this one. It’s not only cloudflare that has such kind of inactivity timeouts. For example, ingress-nginx also terminates inactive gRPC connections, and as of today it requires messing with the configuration-snippet to get netbird to work properly. Implementing app-level keepalive would definitely improve user experience for more sophisticated deployments. Looking forward to seeing this implemented!

Traefik also implements a default readTimeout setting of 60 seconds on its entrypoints. As I understand it is security control to limit DDoS attacks. Choosing to disable readTimeouts is a security risk that each organization would have to accept in order to run Netbird behind a proxy. Furthermore, Cloudflare customers can not disable read timeouts, with a max available setting of 6000s for their enterprise customers.

Edit: To anyone interested you can use Traefik's "reusePort" function on the entry point declarations to duplicate the entry point and then specify specific read timeout settings that only applies per entry point.

JerboaGobi avatar Jun 01 '24 15:06 JerboaGobi