OvenMediaEngine icon indicating copy to clipboard operation
OvenMediaEngine copied to clipboard

OVT Streamworker can get stuck, eating all system memory

Open SceneCityDev opened this issue 10 months ago • 3 comments

An Edge for unknown reasons has reconnected its OVT connection to the Origin:

[2024-04-01 00:03:11.996] W [SPOvtPub-t9000:957880] Socket.Pool.Worker | socket_pool_worker.cpp:218  | [#16] [0x7fc9020c9b10] Failed to send data for 10000ms - This socket is going to be garbage collected (<ClientSocket: 0x7fc8f7550010, #27, Connected, TCP, Nonblocking, 136.243.153.227:43788>)
[2024-04-01 00:03:11.996] W [SPOvtPub-t9000:957880] Socket | socket.cpp:2109 | [#27] [0x7fc8f7550010] Socket was closed even though there was 4244 commands in the socket
[2024-04-01 00:03:17.009] I [SPOvtPub-t9000:957880] OVT | ovt_publisher.cpp:181  | OvtProvider is connected : <ClientSocket: 0x7fc8f7401c10, #34, Connected, TCP, Nonblocking, 142.132.205.123:55476>
[2024-04-01 00:03:17.024] I [SPOvtPub-t9000:957880] Monitor | stream_metrics.cpp:114  | A new session has started playing #default#origin1/revision-main on the OVT publisher. OVT(16)/Stream total(16)/App total(16)

Due to this, the OVT queue then got stuck and started overflowing:

[2024-04-01 00:03:17.674] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 2281, threshold: 500, peak: 2281
[2024-04-01 00:03:22.681] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 4306, threshold: 500, peak: 4306
[2024-04-01 00:03:27.696] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 6341, threshold: 500, peak: 6341
[2024-04-01 00:03:32.708] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 8399, threshold: 500, peak: 8399
[2024-04-01 00:03:37.719] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 10445, threshold: 500, peak: 10445
[2024-04-01 00:03:42.738] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 12491, threshold: 500, peak: 12491
[2024-04-01 00:03:47.769] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 14544, threshold: 500, peak: 14544
[2024-04-01 00:03:52.781] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 16591, threshold: 500, peak: 16591
[2024-04-01 00:03:57.797] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 18629, threshold: 500, peak: 18629
[2024-04-01 00:04:02.807] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 20673, threshold: 500, peak: 20673
[2024-04-01 00:04:07.828] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 22685, threshold: 500, peak: 22685
[2024-04-01 00:04:12.842] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 24691, threshold: 500, peak: 24691
[2024-04-01 00:04:17.852] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 26671, threshold: 500, peak: 26671
[2024-04-01 00:04:22.863] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 28650, threshold: 500, peak: 28650

...it continued to do so for a full hour until all 16GB system memory had been used up, and the Linux kernel OOM killer killed the OME process:

[2024-04-01 01:01:03.903] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 1422024, threshold: 500, peak: 1422024
[2024-04-01 01:01:08.921] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 1424007, threshold: 500, peak: 1424007
[2024-04-01 01:17:46.146] W [AW-OVT0:957902] ManagedQueue | managed_queue.h:444  | [122] mngq:v=#default#origin1:s=revision-main:p=pub:n=streamworker_ovt size has exceeded the threshold: queue: 1425251, threshold: 500, peak: 1425251

There may be something buggy with the socket error handling here.

For the OVT streamworker, SetPreventExceedThreshold is not set, which I find dangerous. But in general I think starting to drop messages would not be the right way to respond to some bug in other parts of the code. I think there should be some error handling in managed queue class for the case where a queue is getting stuck, in this case killing the stuck streamworker thread.

SceneCityDev avatar Apr 03 '24 20:04 SceneCityDev

This appears to be a bug where the OVT Pulbisher gets stuck in a deadlock. Could you please post the full log containing the problem? If you do not want your logs to be made public, please send them to [email protected]. thank you

getroot avatar Apr 04 '24 06:04 getroot

Sent.

SceneCityDev avatar Apr 05 '24 16:04 SceneCityDev

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 05 '24 00:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 09 '24 02:08 stale[bot]