openpgm icon indicating copy to clipboard operation
openpgm copied to clipboard

Bizarre memory issue

Open AntonioGMuriana opened this issue 5 years ago • 2 comments

Hello,

We are having an issue with OpenPGM (called via ZeroMQ from a .NET app). The app is running with around 70 MB but at some point the memory of the process starts to grow linearly until it fills up all the memory (<3GB)

We have profiled the application and the memory is filled up with millions of instances of pgm_sk_buff_t structs of 1684 bytes generated with this stack trace:

StackTrace

Any guidance for troubleshooting this?

AntonioGMuriana avatar Mar 21 '19 08:03 AntonioGMuriana

This would be one of two cases:

  1. Messages are being received but never returned to the memory pool.
  2. A packet is lost and not recovered, the receive window is configured larger than the host memory and thus fills up.

ZeroMQ is using the pgm_recvmsgv https://code.google.com/archive/p/openpgm/wikis/OpenPgm5CReferencePgmRecvMsgv.wiki API which returns SKB references from the receive window which implies #2 as #1 is not possible.

So the question is what the receive window is set to? Try reducing that value significantly. ZeroMQ is implemented for best effort, thus when data is lost the window will be discarded and the PGM connection will reset:

https://github.com/zeromq/libzmq/blob/0d660674112a5c7a0602a8f9a4ece6488631d9cc/src/pgm_socket.cpp#L598

On Thu, Mar 21, 2019 at 4:29 AM AntonioGM [email protected] wrote:

Hello,

We are having an issue with OpenPGM (called via ZeroMQ from a .NET app). The app is running with around 70 MB but at some point the memory of the process starts to grow linearly until it fills up all the memory (<3GB)

We have profiled the application and the memory is filled up with millions of instances of pgm_sk_buff_t structs of 1684 bytes generated with this stack trace:

[image: StackTrace] https://user-images.githubusercontent.com/16045093/54740441-b50ce700-4bbb-11e9-8c78-2c7c609522e0.png

Any guidance for troubleshooting this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/steve-o/openpgm/issues/59, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPyQ2PN16r5Mz43BuYficUkm_3RQyEbks5vY0LTgaJpZM4cA6-p .

steve-o avatar Mar 21 '19 13:03 steve-o

Hello,

After further debugging, this is my current findings:

ZMQ_RATE = 1Gbps (ok, it's not realistic, i know) ZMQ_RECOVERY_IVL = 10s (default value)

On the receiving side, there a socket with a window with the following characteristics: window->alloc = 833333 window->lead = 4824387 window->commit_lead = 3991059 window->trail = 3991055 window->committed_count = 4 window->rxw_trail = 4539359

window[commit_lead]->state = PGM_PKT_STATE_LOST_DATA

The lead cannot move because the window is full, so all received data is being lost. The trail cannot move because there are commit data not delivered, but it is never delivered, because all data received is discarded and commit_lead doesn't move because it's outside rxw_trail.

The trace log shows a lot of NCF retries and then a pair of "Locking trail at commit window".

How is suppoused the window to move on this situation?

Note: Memory is not exhausted (it just grows to maximum receiving buffer of 1.5 GB)

AntonioGMuriana avatar Mar 26 '19 23:03 AntonioGMuriana