caddy icon indicating copy to clipboard operation
caddy copied to clipboard

Memory leaks

Open KGOH opened this issue 1 year ago • 17 comments

I have a minimal config, and I constantly observe leaking memory. I used version 2.5.0 and later upgrated to 2.7, but this didn't help

{$GRAFANA_DOMAIN} {
  reverse_proxy grafana:3000
}

{$COLLECTOR_DOMAIN} {
  reverse_proxy myapp:8080
}

image

Caddy is deployed via docker compose, here's a snippet:

  caddy:
    restart: unless-stopped
    image: caddy:2.7
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - ./ssl:/myapp/ssl
      - caddy_data:/data
    ports:
      - 80:80
      - 443:443
    networks:
      default:
        aliases:
          - ${GRAFANA_DOMAIN}
          - ${COLLECTOR_DOMAIN}
    environment:
      GRAFANA_DOMAIN:   ${GRAFANA_DOMAIN}
      COLLECTOR_DOMAIN: ${COLLECTOR_DOMAIN}

KGOH avatar May 17 '24 14:05 KGOH

Is it possible that the backend isn't closing connections? What is myapp:8080 serving? How can we reproduce it?

If you could please collect a memory and goroutine profile that would help answer some questions: https://caddyserver.com/docs/profiling

Thank you

mholt avatar May 17 '24 15:05 mholt

@mholt myapp is a REST API server. Currently I'm not sure how exactly you can reproduce it without me sharing the whole app (which I can't do)

Here are memory and goroutine profiles goroutines.txt heap_debug=1.gz

here's an example of memory graph on the machine I got these profiles from image

KGOH avatar May 20 '24 13:05 KGOH

Thanks -- interesting... what kind of traffic is your server getting?

The top heaps and goroutines are simply growing buffers to read from TLS connections, and serving HTTP/2, respectively.

Goroutines:

694 @ 0x43e44e 0x44e905 0xb234cf 0xb217cc 0xb20bbc 0x769b93 0x471901
#	0xb234ce	golang.org/x/net/http2.(*serverConn).serve+0x88e	golang.org/x/[email protected]/http2/server.go:940
#	0xb217cb	golang.org/x/net/http2.(*Server).ServeConn+0xbcb	golang.org/x/[email protected]/http2/server.go:531
#	0xb20bbb	golang.org/x/net/http2.ConfigureServer.func1+0xfb	golang.org/x/[email protected]/http2/server.go:321
#	0x769b92	net/http.(*conn).serve+0x1212				net/http/server.go:1917

690 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e24ba 0x4e24a8 0x545825 0x558ea5 0x69841b 0x50dc38 0x6985fe 0x695bd0 0x69be98 0x69bea1 0x4b8350 0xb16fe5 0xb16fb4 0xb17725 0xb22667 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84			runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26			internal/poll/fd_poll_runtime.go:84
#	0x4e24b9	internal/poll.(*pollDesc).waitRead+0x279		internal/poll/fd_poll_runtime.go:89
#	0x4e24a7	internal/poll.(*FD).Read+0x267				internal/poll/fd_unix.go:164
#	0x545824	net.(*netFD).Read+0x24					net/fd_posix.go:55
#	0x558ea4	net.(*conn).Read+0x44					net/net.go:179
#	0x69841a	crypto/tls.(*atLeastReader).Read+0x3a			crypto/tls/conn.go:805
#	0x50dc37	bytes.(*Buffer).ReadFrom+0x97				bytes/buffer.go:211
#	0x6985fd	crypto/tls.(*Conn).readFromUntil+0xdd			crypto/tls/conn.go:827
#	0x695bcf	crypto/tls.(*Conn).readRecordOrCCS+0x24f		crypto/tls/conn.go:625
#	0x69be97	crypto/tls.(*Conn).readRecord+0x157			crypto/tls/conn.go:587
#	0x69bea0	crypto/tls.(*Conn).Read+0x160				crypto/tls/conn.go:1369
#	0x4b834f	io.ReadAtLeast+0x8f					io/io.go:335
#	0xb16fe4	io.ReadFull+0x64					io/io.go:354
#	0xb16fb3	golang.org/x/net/http2.readFrameHeader+0x33		golang.org/x/[email protected]/http2/frame.go:237
#	0xb17724	golang.org/x/net/http2.(*Framer).ReadFrame+0x84		golang.org/x/[email protected]/http2/frame.go:498
#	0xb22666	golang.org/x/net/http2.(*serverConn).readFrames+0x86	golang.org/x/[email protected]/http2/server.go:820

Heaps:

261: 10690560 [3380: 138444800] @ 0x50ddce 0x50d7fd 0x698594 0x698557 0x6964ff 0x69be98 0x69bea1 0x76322b 0x56d763 0x4b89a3 0x7773bb 0x777298 0x4b89a3 0x4b8806 0x55a386 0x55a343 0x5611ab 0x55fcf0 0x4b87a7 0x781d55 0x781d0f 0x56f74b 0x4b87a7 0x7759a8 0x77598f 0x775525 0x75ccad 0x7849ef 0x471901
#	0x50ddcd	bytes.growSlice+0x8d				bytes/buffer.go:249
#	0x50d7fc	bytes.(*Buffer).grow+0x13c			bytes/buffer.go:151
#	0x698593	bytes.(*Buffer).Grow+0x73			bytes/buffer.go:168
#	0x698556	crypto/tls.(*Conn).readFromUntil+0x36		crypto/tls/conn.go:826
#	0x6964fe	crypto/tls.(*Conn).readRecordOrCCS+0xb7e	crypto/tls/conn.go:676
#	0x69be97	crypto/tls.(*Conn).readRecord+0x157		crypto/tls/conn.go:587
#	0x69bea0	crypto/tls.(*Conn).Read+0x160			crypto/tls/conn.go:1369
#	0x76322a	net/http.(*connReader).Read+0x14a		net/http/server.go:791
#	0x56d762	bufio.(*Reader).Read+0xe2			bufio/bufio.go:230
#	0x4b89a2	io.(*LimitedReader).Read+0x42			io/io.go:480
#	0x7773ba	net/http.(*body).readLocked+0x3a		net/http/transfer.go:839
#	0x777297	net/http.(*body).Read+0x117			net/http/transfer.go:831
#	0x4b89a2	io.(*LimitedReader).Read+0x42			io/io.go:480
#	0x4b8805	io.copyBuffer+0x1a5				io/io.go:430
#	0x55a385	io.Copy+0x65					io/io.go:389
#	0x55a342	net.genericReadFrom+0x22			net/net.go:671
#	0x5611aa	net.(*TCPConn).readFrom+0x6a			net/tcpsock_posix.go:54
#	0x55fcef	net.(*TCPConn).ReadFrom+0x2f			net/tcpsock.go:130
#	0x4b87a6	io.copyBuffer+0x146				io/io.go:416
#	0x781d54	io.Copy+0x54					io/io.go:389
#	0x781d0e	net/http.persistConnWriter.ReadFrom+0xe		net/http/transport.go:1801
#	0x56f74a	bufio.(*Writer).ReadFrom+0x18a			bufio/bufio.go:797
#	0x4b87a6	io.copyBuffer+0x146				io/io.go:416
#	0x7759a7	io.Copy+0x47					io/io.go:389
#	0x77598e	net/http.(*transferWriter).doBodyCopy+0x2e	net/http/transfer.go:412
#	0x775524	net/http.(*transferWriter).writeBody+0x3c4	net/http/transfer.go:370
#	0x75ccac	net/http.(*Request).write+0xbac			net/http/request.go:738
#	0x7849ee	net/http.(*persistConn).writeLoop+0x18e		net/http/transport.go:2424

14: 573440 [7879: 322723840] @ 0x50ddce 0x50d7fd 0x698594 0x698557 0x6964ff 0x69be98 0x69bea1 0x4b8350 0xb16fe5 0xb16fb4 0xb17725 0xb22667 0x471901
#	0x50ddcd	bytes.growSlice+0x8d					bytes/buffer.go:249
#	0x50d7fc	bytes.(*Buffer).grow+0x13c				bytes/buffer.go:151
#	0x698593	bytes.(*Buffer).Grow+0x73				bytes/buffer.go:168
#	0x698556	crypto/tls.(*Conn).readFromUntil+0x36			crypto/tls/conn.go:826
#	0x6964fe	crypto/tls.(*Conn).readRecordOrCCS+0xb7e		crypto/tls/conn.go:676
#	0x69be97	crypto/tls.(*Conn).readRecord+0x157			crypto/tls/conn.go:587
#	0x69bea0	crypto/tls.(*Conn).Read+0x160				crypto/tls/conn.go:1369
#	0x4b834f	io.ReadAtLeast+0x8f					io/io.go:335
#	0xb16fe4	io.ReadFull+0x64					io/io.go:354
#	0xb16fb3	golang.org/x/net/http2.readFrameHeader+0x33		golang.org/x/[email protected]/http2/frame.go:237
#	0xb17724	golang.org/x/net/http2.(*Framer).ReadFrame+0x84		golang.org/x/[email protected]/http2/frame.go:498
#	0xb22666	golang.org/x/net/http2.(*serverConn).readFrames+0x86	golang.org/x/[email protected]/http2/server.go:820
...

There's no memory leak here... just normal traffic. Not even a lot of traffic.

What does top or htop on the machine show?

I'd also be interested if you can reproduce this without the API backend.

mholt avatar May 20 '24 15:05 mholt

@mholt can this imply that my app doesn't close connections?

I'll get back to you with htop later

KGOH avatar May 20 '24 16:05 KGOH

Quite possibly. It's a bit unfortunate that the graphs only show a single process. That's why I'm curious what the output of top is.

mholt avatar May 20 '24 16:05 mholt

@mholt, hello. The app is a java jetty server that has idle timeout of 200000ms (about 3.5 minutes) (doc)

Here's a graph for the java app, grafana and caddy processes for the last month. As you can see, there was a restart after a certain memory threshold. The java app doesn't seem to constantly grow in memory. The java app receives mobile traffic, that probably can drop due to unstable internet connection image

As I can see in the caddy docs, caddy's default request idle timeout is 5 minutes https://caddyserver.com/docs/caddyfile/options#timeouts This doesn't explain why we see the growing memory consumption

KGOH avatar May 22 '24 10:05 KGOH

What are your clients / traffic patterns like?

Try setting some timeouts: https://caddyserver.com/docs/caddyfile/options#timeouts

The majority of the (few) goroutines that show up in the profiles are waiting for reads.

mholt avatar May 22 '24 15:05 mholt

Timeouts didn't help (the blue line that ends abruptly is caddy before I added timeouts, the red line is caddy after I restarted it with timeouts config added) image

I added the following lines to Caddyfile

{
  servers {
    timeouts {
      read_body   5m
      read_header 5m
      write       5m
      idle        5m
    }
  }
}

I'm not sure how to describe traffic pattern. In short there are a lot of mobile connections. I guess they can drop off without properly closing connections. At the same time I observe a big number of such exceptions: ~15000 in 14 days: org.eclipse.jetty.io.EofException Early EOF image

KGOH avatar Jun 07 '24 10:06 KGOH

I'm not sure whether this can be related. On my memory usage graphs you can also notice that java memory sometimes is growing somewhat similarly (not here though!)

Right now I'm also observing a memory leak within java image For the sake of context, here's another issue I created in the jetty server repo https://github.com/jetty/jetty.project/issues/11858

KGOH avatar Jun 07 '24 10:06 KGOH

Interesting... one other thought I have is, what if you disable HTTP/3?

https://caddyserver.com/docs/caddyfile/options#protocols

mholt avatar Jun 07 '24 12:06 mholt

Has anyone been able to replicate this problem? Does it happen with 2.8.4?

hackerfactor avatar Jun 11 '24 14:06 hackerfactor

Just had also a server memory swap detected due to Caddy after upgrading to 2.8.4. It seems it is still there (and on my side, this is only after upgrading from 2.7.6 to 2.8.4 that I observe it). Don't know if it can be also related, but I also use Caraza WAF module (v2.0.0-rc.3)

zguig52 avatar Jul 31 '24 07:07 zguig52

Don't know if it can be also related, but I also use Caraza WAF module (v2.0.0-rc.3)

Very much could be related. We can't really take builds with plugins as any kind of evidence because the problem could be in the plugin.

francislavoie avatar Jul 31 '24 08:07 francislavoie

@zguig52 A profile would help if you could get one. https://caddyserver.com/docs/profiling

mholt avatar Aug 02 '24 13:08 mholt

Since my java app also has memory growing I replaced caddy with nginx to test whether the issue is in caddy or in java. I've been observing the memory consumption for a week now and I see that the java app memory consumption growth rate has decreased. Interesting

image image

KGOH avatar Aug 07 '24 11:08 KGOH

We have default keepalive config while nginx doesn't. Can you try setting the keepalive* config in reverse_proxy (docs) to something suitable for your app workload and monitor the memory consumption? You did mention that your upstream app doesn't close connections

mohammed90 avatar Aug 07 '24 11:08 mohammed90

Jetty follows the HTTP/1.0 (RFC1945) and HTTP/1.1 (RFC9112) specs.

If the client says HTTP/1.0 in its request header ..

  • then the Connection default is close after the exchange is complete.
  • the client can override default by using Connection: keep-alive request header.
  • using Connection: close with a HTTP/1.0 request is a no-op as the value close is not recognized by the HTTP/1.0 spec.

If the client says HTTP/1.1 in its request header ...

  • then the Connection default is persistent and the connection will not close after the exchange is complete.
  • the client can override default by using Connection: close request header.
  • using Connection: keep-alive with a HTTP/1.1 request is a no-op as the value keep-alive is not recognized by the HTTP/1.1 spec.

If you want to see a more comprehensive set of tests related to the Connection header in Jetty (not limited to just close and keep-alive values), see the testcases.

https://github.com/jetty/jetty.project/blob/661546ecffbd2f2d8f1cf7b5ebebec8e4f06549e/jetty-core/jetty-server/src/test/java/org/eclipse/jetty/server/HttpConnectionTest.java#L1593-L1689

That shows the 4 piece of information Jetty uses to determine connection behavior. (the user-agent request HTTP spec version, the user-agent request Connection header value, the application specified response Connection header value, and the user-agent seen response Connection header value)

joakime avatar Aug 07 '24 13:08 joakime