grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

High memory usage inside transport.newBufWriter (memory leak?)

Open wh75er opened this issue 2 years ago • 11 comments

Hello! We've confronted an interesting case of memory growth inside our gRPC service (server side). Memory growth pattern is really tiny (about 10-13MB per 24h), but it's noticeable. Growth continued up to 6 days (definitely a problem)

How did it all start?

We've updated our gRPC to a new version (1.27.0 up to 1.47.0), because of a new dependency in our repository. There are no changes at all in our service's code.

After a few days memory growth was noticed. We've tried to wait and watch, but growth hasn't stopped.

Then we've tried to update our gRPC to the latest version, but it doesn't help to solve the problem. We've also collected some profiles to trace causes of this trouble

Interesting fact

In our repository we've also have one more gRPC server (let's call it service B). It uses the same gRPC code as a server A (the service we have problem with). Service B has lower RPS than service A and there are no problems at all

Technical info

  1. golang version: 1.20.1
  2. gRPC version: 1.47.0 / 1.54.0 (migrated from 1.27.0)
  3. gRPC service has only one entry (one service with one method)
  4. RPS: ~5K

Pprof data

We've collected multiple samples (02.05 25.04 26.04 27.04 28.04). Diff shows that growth occurs inside grpc internals

go tool pprof -drop_negative -inuse_space -diff_base=25.04/pprof-heap-25.04-2.txt 02.05/pprof-heap-02.05-1.txt

(pprof) top
Showing nodes accounting for 17.64MB, 8.21% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
    1.50MB   0.7%  5.65%     1.50MB   0.7%  runtime.malg
       1MB  0.47%  6.12%        1MB  0.47%  github.com/GoWebProd/psqr.NewQuantile
       1MB  0.47%  6.58%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
       1MB  0.47%  7.05%        1MB  0.47%  bytes.(*Buffer).grow
    0.50MB  0.23%  7.28%     0.50MB  0.23%  bufio.NewWriterSize
    0.50MB  0.23%  7.51%     0.50MB  0.23%  google.golang.org/grpc/metadata.Join
    0.50MB  0.23%  7.75%     0.50MB  0.23%  sync.(*Pool).pinSlow
(pprof) top -cum 25
Showing nodes accounting for 14.64MB, 6.81% of 214.84MB total
Dropped 2 nodes (cum <= 1.07MB)
Showing top 25 nodes out of 63
      flat  flat%   sum%        cum   cum%
   10.64MB  4.95%  4.95%    10.64MB  4.95%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).Serve.func3
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).handleRawConn
         0     0%  4.95%     8.56MB  3.98%  google.golang.org/grpc.(*Server).newHTTP2Transport
       1MB  0.47%  5.42%     8.56MB  3.98%  google.golang.org/grpc/internal/transport.NewServerTransport
         0     0%  5.42%     7.56MB  3.52%  google.golang.org/grpc/internal/transport.newFramer
         0     0%  5.42%     2.50MB  1.16%  src/service/api._ProcessRequest_Handler
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).handleStream
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0%  5.42%     2.50MB  1.16%  google.golang.org/grpc.(*Server).serveStreams.func1.2
         0     0%  5.42%     2.50MB  1.16%  main.(*Service).AuthUnaryInterceptor
    1.50MB   0.7%  6.12%     1.50MB   0.7%  runtime.malg
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc.func1
         0     0%  6.12%     1.50MB   0.7%  runtime.newproc1
         0     0%  6.12%     1.50MB   0.7%  runtime.systemstack

How to reproduce?

I've tried to reproduce it by creating a small load (about ~100 RPS) on service, but memory hasn't changed. I think trouble reproduces only on a high load

Related issues

I've checked different issues on this topic and found out that there are indeed some problems with write buffer in gRPC internals. Maybe this issue is somehow related to all of this:

https://github.com/grpc/grpc-go/issues/5751

https://github.com/grpc/grpc-go/issues/5757 https://github.com/grpc/grpc-go/issues/5758 https://github.com/grpc/grpc-go/issues/5759

I haven't found any issue reporting memory grow though

Conclusion

Thank you for your time. I would really appreciate if you:

  1. suggest what could be checked in our code in case if we haven't closed some resources and there is a problem on our side
  2. (or) confirm that there is a problem with gRPC buffer writer and memory growth is related to this

I'll try to find out how to reproduce it on a higher load and check why does transport.newBufWriter grow in memory. I will share my results here if I get something interesting

wh75er avatar May 02 '23 15:05 wh75er

It is not clear to me whether you are seeing an increase in memory utilization of about 10-15MB between your server application running gRPC version 1.27 and server application running gRPC version 1.47, or if the memory utilization in your server application running gRPC version 1.47 is continuously growing.

If the answer is the former, I'm afraid, we can't help much. 1.27.0 was released more than 3 years ago and so much has changed since then. The bufWriter code hasn't changed much though.

If your answer if the latter, i.e memory utilization is constantly growing, then we would like to hear more from you.

easwars avatar May 02 '23 19:05 easwars

Hello, @easwars! Sorry for inconvenience. The memory utilization is continuously growing on our server application running gRPC version 1.47 (and later we have tried 1.54).

Our memory problem have begun after migrating to a new gRPC version 1.47.0. Then, we have tried updating it to the latest version - 1.54.0 (in case if there were some fixes), but it didn't help. So timeline looks something like that

1.27.0 (memory utilization is fine) -(update)-> 1.47.0 (memory utilization is constantly growing) -(update)-> 1.54.0 (memory utilization is constantly growing)

My point was that we have 2 different gRPC server applications in the same repository. Each of them use the same vendor gRPC package. It looks like on one server application we can observe growth of the memory utilization, but on another gRPC server application memory usage is okay (the memory utilization is not growing). This detail makes me think that there is an error on the first gRPC server application on our side (probably it's not the case, because we only updated gRPC package and didn't modify any code), or RPS slightly differs between this two servers which causes issue to be observed only in one place (I'm not sure about this either, because RPS difference is not that high ~ less than 1K). Maybe one application uses bufWriter more often than another

wh75er avatar May 03 '23 09:05 wh75er

Can you get a pprof dump that we can analyze?

Also, does the memory utilization come down in the service with high QPS when the QPS comes down? If so, then it could indicate that we are not leaking memory.

easwars avatar May 03 '23 22:05 easwars

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar May 09 '23 22:05 github-actions[bot]

Sorry for long reply!

does the memory utilization come down in the service with high QPS when the QPS comes down

No, it doesn't. Looks like it grows with QPS, but when QPS goes down it keeps the memory utilization the same after last QPS peak. When QPS starts to grow again, it seems like the memory utilization continues to grow and then locks on

Can you get a pprof dump that we can analyze?

Sure, I've sent 2 dumps to your email, so you could check the diff. You can find it by subject high memory usage #6242

wh75er avatar May 10 '23 10:05 wh75er

Also forgot to tag you, @easwars

Doing it here

wh75er avatar May 10 '23 10:05 wh75er

Thanks @wh75er. We will take a look soon.

easwars avatar May 10 '23 16:05 easwars

Hello! Is there any updates on the issue?

wh75er avatar May 18 '23 11:05 wh75er

Really sorry. I haven't had cycles to look into this.

@zasweq @arvindbr8 @dfawley Would one of you be able to look into this? Thanks.

easwars avatar May 18 '23 18:05 easwars

There are a lot of moving pieces in a system like this, and so this will be hard to help, here. Some things to consider:

  • Do you track the number of open connections? Is the memory scaling consistently with open connections?
  • Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters
  • Does your service also have outgoing gRPC connections (is it a gRPC client as well)?
  • What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?
  • If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

dfawley avatar May 18 '23 22:05 dfawley

Do you track the number of open connections? Is the memory scaling consistently with open connections?

No, I don't. There is no data on how does open connections affect memory scaling. I'll try to figure it out

Do you have max connection age or max connection idle enabled? https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters

Yes, I have. But there were no problems with that and this part has not been modified

MaxConnectionIdle: time.Minute,
MaxConnectionAgeGrace: 5 * time.Second,
Time:                  5 * time.Second,
Timeout:               1 * time.Second,

Does your service also have outgoing gRPC connections (is it a gRPC client as well)?

No, it doesn't

What is the traffic like? Are all calls unary or are there streams too? Do clients connect and disconnect frequently?

Traffic is not uniform. There is a prime time where RPS is high, after that it decreases.

The application uses only unary calls. There is one interceptor with this feature - headers are attached to responses. I had an idea that this is somehow affecting grpc Writer, but I haven't succeeded in reproducing it by myself

If you can reliably reproduce the behavior, but can't debug the code to determine the cause, it would be interesting to determine when the problem appeared (what release, or even better, what commit).

No, I can not reproduce it at all. It seems like the problem appeared after a commit with gRPC upgrade

wh75er avatar May 19 '23 19:05 wh75er

At this point, unfortunately without a good way to repro this problem, there is very little that we can help with. We will leave this open for a little bit longer before closing. If you find a good way for us to repro this problem we will gladly take a deeper look. Thank you for your patience.

easwars avatar Jul 26 '23 23:07 easwars

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Aug 02 '23 01:08 github-actions[bot]

Anyupdate on this issue?

shantanubansal avatar Feb 29 '24 09:02 shantanubansal

This issue was closed 4 months ago. Please consider opening a new issue instead of leaving a comment here if you need to clarify something and maybe reference this issue in the new one

arvindbr8 avatar Feb 29 '24 23:02 arvindbr8