Ant-Media-Server icon indicating copy to clipboard operation
Ant-Media-Server copied to clipboard

Inconsistent use of Tesla T4 GPU causes fluctuating broadcasting

Open alfred-stokespace opened this issue 1 year ago • 7 comments

Enterprise Edition 2.7.0 20231031_0626 Continuation of https://github.com/ant-media/Ant-Media-Server/issues/5590

Short description

I have 8 inbound rtmp streams (1080p) and 3 transcoding renditions (default-480,default-720,default-1080), none of them can sustain broadcasting and flip wildly from Broadcasting 0.01x up to Broadcasting 100x none of them can support playback for longer then a few seconds.

Environment

  • Ubuntu 20.04.6 LTS
  • Java version: build 11.0.20.1+1-post-Ubuntu-0ubuntu120.04
  • Ant Media Server version: Enterprise Edition 2.7.0 20231031_0626
  • Browser name and version: N/A

Steps to reproduce

  1. Install 2.7.0 on g4dn.12xlarge
  2. have 8 rtmp 1080 30fps @ about 4-8 mbps each being transcoded to 3 transcoding renditions (default-480,default-720,default-1080)
  3. check nvidia-smi output

Expected behavior

Same performance as 2.4.3 (which is able to handle the exact same camera sources AND MORE and rendition same count/type) All 4 gpu's are utilized on 2.4.3 and 2.4.3 can keep up all 25 streams at 99 to 101 percent broadcast status.

Actual behavior

Only a fraction of the streams can keep up and nvidia-smi shows the following...

Mon Nov 20 17:54:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0    47W /  70W |   6843MiB / 15360MiB |     54%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |    891MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |    577MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |    969MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3978      C   ...11-openjdk-amd64/bin/java     6821MiB |
|    1   N/A  N/A      3978      C   ...11-openjdk-amd64/bin/java      882MiB |
|    2   N/A  N/A      3978      C   ...11-openjdk-amd64/bin/java      568MiB |
|    3   N/A  N/A      3978      C   ...11-openjdk-amd64/bin/java      960MiB |
+-----------------------------------------------------------------------------+

if you keep issuing nvidia-smi you'll eventually see that other GPU's get activated and then dropped (notice how GPU 2 is 14%. This is different from https://github.com/ant-media/Ant-Media-Server/issues/5590 where

Mon Nov 20 17:40:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   36C    P0    48W /  70W |   7728MiB / 15360MiB |     64%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   27C    P0    25W /  70W |   1185MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   27C    P0    26W /  70W |    693MiB / 15360MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P0    26W /  70W |    901MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

so this is different then

Logs

will send to support upon request.

Lots of these in the logs...

2023-11-20 18:02:06,145 [vertx-blocked-thread-checker] WARN  i.v.core.impl.BlockedThreadChecker - Thread Thread[vert.x-worker-thread-34,5,main] has been blocked for 57892 ms, time limit is 10000 ms
io.vertx.core.VertxException: Thread blocked
	at org.bytedeco.ffmpeg.global.avfilter.avfilter_graph_free(Native Method)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.freeFilterResources(H264Encoder.java:870)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.freeEncoderRelatedResources(H264Encoder.java:861)
	at io.antmedia.enterprise.adaptive.base.VideoEncoder.writeTrailer(VideoEncoder.java:397)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.writeTrailer(H264Encoder.java:695)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.writeEncodeTrailers(StreamAdaptor.java:428)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.execute(StreamAdaptor.java:276)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.lambda$start$0(StreamAdaptor.java:182)
	at io.antmedia.enterprise.adaptive.StreamAdaptor$$Lambda$510/0x0000000800802840.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
	at io.vertx.core.impl.ContextImpl$$Lambda$404/0x00000008005c6440.handle(Unknown Source)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
	at io.vertx.core.impl.ContextImpl$$Lambda$401/0x00000008005c7440.run(Unknown Source)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at [email protected]/java.lang.Thread.run(Thread.java:829)
2023-11-20 18:02:06,146 [vertx-blocked-thread-checker] WARN  i.v.core.impl.BlockedThreadChecker - Thread Thread[vert.x-worker-thread-100,5,main] has been blocked for 571085 ms, time limit is 10000 ms
io.vertx.core.VertxException: Thread blocked
	at org.bytedeco.ffmpeg.global.avcodec.avcodec_send_frame(Native Method)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.avCodecSendFrame(H264Encoder.java:729)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.sendPacket2Encoder(H264Encoder.java:713)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.writeFrameInternal(H264Encoder.java:208)
	at io.antmedia.enterprise.adaptive.base.VideoEncoder.writeFrame(VideoEncoder.java:275)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.write2VideoEncoders(StreamAdaptor.java:347)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.execute(StreamAdaptor.java:228)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.lambda$start$0(StreamAdaptor.java:182)
	at io.antmedia.enterprise.adaptive.StreamAdaptor$$Lambda$510/0x0000000800802840.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
	at io.vertx.core.impl.ContextImpl$$Lambda$404/0x00000008005c6440.handle(Unknown Source)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
	at io.vertx.core.impl.ContextImpl$$Lambda$401/0x00000008005c7440.run(Unknown Source)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at [email protected]/java.lang.Thread.run(Thread.java:829)
2023-11-20 18:02:06,146 [vertx-blocked-thread-checker] WARN  i.v.core.impl.BlockedThreadChecker - Thread Thread[vert.x-worker-thread-78,5,main] has been blocked for 591002 ms, time limit is 10000 ms
io.vertx.core.VertxException: Thread blocked
	at org.bytedeco.ffmpeg.global.avcodec.avcodec_send_frame(Native Method)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.avCodecSendFrame(H264Encoder.java:729)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.sendPacket2Encoder(H264Encoder.java:713)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.writeFrameInternal(H264Encoder.java:208)
	at io.antmedia.enterprise.adaptive.base.VideoEncoder.writeFrame(VideoEncoder.java:275)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.write2VideoEncoders(StreamAdaptor.java:347)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.execute(StreamAdaptor.java:228)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.lambda$start$0(StreamAdaptor.java:182)
	at io.antmedia.enterprise.adaptive.StreamAdaptor$$Lambda$510/0x0000000800802840.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
	at io.vertx.core.impl.ContextImpl$$Lambda$404/0x00000008005c6440.handle(Unknown Source)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
	at io.vertx.core.impl.ContextImpl$$Lambda$401/0x00000008005c7440.run(Unknown Source)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at [email protected]/java.lang.Thread.run(Thread.java:829)
2023-11-20 18:02:06,223 [Thread-347] INFO  i.a.e.adaptive.StreamAdaptor - Queue size(2001) is exceeding 2000 so dropping frame for stream: REDACTEDCAMERANAME

alfred-stokespace avatar Nov 20 '23 18:11 alfred-stokespace

As an experimenting I'm deactivating all the streams. Then one at a time starting them up with a couple minutes between each start up. Results...

  1. one stream, stable (6% of 1 gpu of 4 available )
  2. two streams, stable (13% of 1 gpu of 4 available )
  3. three streams, stable (20% of 1 gpu of 4 available )
  4. four streams, stable (27% of 1 gpu of 4 available )
  5. five streams, stable (34% of 1 gpu of 4 available )
  6. six streams, stable (40% of 1 gpu of 4 available )
  7. seven streams, stable (47% of 1 gpu of 4 available )
  8. eight streams, stable (55% of 1 gpu of 4 available )
  9. nine streams, stable (56% of 1 gpu of 4 available ) no change
  10. ten streams, stable (56% of 1 gpu of 4 available ) no change
  11. eleven streams, stable (68% of 1 gpu of 4 available )
  12. twelve streams, stable (68% of 1 gpu of 4 available ) no change
  13. thriteen streams, stable (70% of 1 gpu of 4 available )
  14. fourteen streams, stable (64-90% of 1 gpu of 4 available ) oscilates
  15. fifteen streams, stable (63-70% of 1 gpu of 4 available ) oscilates
  16. sixteen streams, stable (63-70% of 1 gpu of 4 available ) oscilates
  17. seveteen streams, stable (63-100% of 1 gpu of 4 available ) oscilates and briefly GPU 2 lit up with 10% then down.

At the moment these are all stable. When I run nvidia-smi I see this

Mon Nov 20 18:41:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    50W /  70W |  11915MiB / 15360MiB |     67%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   28C    P0    25W /  70W |   1283MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |   1185MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |   1087MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java    11887MiB |
|    1   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1274MiB |
|    2   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1176MiB |
|    3   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1078MiB |
+-----------------------------------------------------------------------------+

alfred-stokespace avatar Nov 20 '23 18:11 alfred-stokespace

After that experiment, I tried resetting all the streams (stop/start) and now we are back in the original failure state shown in the main issue opening. all the channels are flopping around at 0.01 x and the logs are now full of

2023-11-20 18:51:44,521 [vertx-blocked-thread-checker] WARN  i.v.core.impl.BlockedThreadChecker - Thread Thread[vert.x-worker-thread-109,5,main] has been blocked for 28292 ms, time limit is 10000 ms
io.vertx.core.VertxException: Thread blocked
	at org.bytedeco.ffmpeg.global.avcodec.avcodec_send_frame(Native Method)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.avCodecSendFrame(H264Encoder.java:729)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.sendPacket2Encoder(H264Encoder.java:713)
	at io.antmedia.enterprise.adaptive.video.H264Encoder.writeFrameInternal(H264Encoder.java:208)
	at io.antmedia.enterprise.adaptive.base.VideoEncoder.writeFrame(VideoEncoder.java:275)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.write2VideoEncoders(StreamAdaptor.java:347)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.execute(StreamAdaptor.java:228)
	at io.antmedia.enterprise.adaptive.StreamAdaptor.lambda$start$0(StreamAdaptor.java:182)
	at io.antmedia.enterprise.adaptive.StreamAdaptor$$Lambda$509/0x0000000800808840.handle(Unknown Source)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
	at io.vertx.core.impl.ContextImpl$$Lambda$404/0x00000008005c6440.handle(Unknown Source)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
	at io.vertx.core.impl.ContextImpl$$Lambda$401/0x00000008005c7440.run(Unknown Source)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at [email protected]/java.lang.Thread.run(Thread.java:829)

and the nvidia-smi command output looks different now (notice the GPU Memory)

Mon Nov 20 18:55:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    47W /  70W |   8972MiB / 15360MiB |     68%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   29C    P0    25W /  70W |   1087MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |    499MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |   1383MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     8947MiB |
|    1   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1078MiB |
|    2   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java      490MiB |
|    3   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1372MiB |
+-----------------------------------------------------------------------------+

QUOTE...

As an experimenting I'm deactivating all the streams. Then one at a time starting them up with a couple minutes between each start up. Results...

  1. one stream, stable (6% of 1 gpu of 4 available )
  2. two streams, stable (13% of 1 gpu of 4 available )
  3. three streams, stable (20% of 1 gpu of 4 available )
  4. four streams, stable (27% of 1 gpu of 4 available )
  5. five streams, stable (34% of 1 gpu of 4 available )
  6. six streams, stable (40% of 1 gpu of 4 available )
  7. seven streams, stable (47% of 1 gpu of 4 available )
  8. eight streams, stable (55% of 1 gpu of 4 available )
  9. nine streams, stable (56% of 1 gpu of 4 available ) no change
  10. ten streams, stable (56% of 1 gpu of 4 available ) no change
  11. eleven streams, stable (68% of 1 gpu of 4 available )
  12. twelve streams, stable (68% of 1 gpu of 4 available ) no change
  13. thriteen streams, stable (70% of 1 gpu of 4 available )
  14. fourteen streams, stable (64-90% of 1 gpu of 4 available ) oscilates
  15. fifteen streams, stable (63-70% of 1 gpu of 4 available ) oscilates
  16. sixteen streams, stable (63-70% of 1 gpu of 4 available ) oscilates
  17. seveteen streams, stable (63-100% of 1 gpu of 4 available ) oscilates and briefly GPU 2 lit up with 10% then down.

At the moment these are all stable. When I run nvidia-smi I see this

Mon Nov 20 18:41:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    50W /  70W |  11915MiB / 15360MiB |     67%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   28C    P0    25W /  70W |   1283MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |   1185MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P0    26W /  70W |   1087MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java    11887MiB |
|    1   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1274MiB |
|    2   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1176MiB |
|    3   N/A  N/A     26084      C   ...11-openjdk-amd64/bin/java     1078MiB |
+-----------------------------------------------------------------------------+

alfred-stokespace avatar Nov 20 '23 18:11 alfred-stokespace

I was about to create another ticket about problems with GPU usage, but it looks similar to this one.

We have recently upgraded our dev server from 2.6.3 to 2.7.0 in order to run some tests, hoping that GPU issues were fixed. Our initial configuration:

  • OS: Ubuntu 22.04 LTS
  • AntMedia: 2.7.0 Enterprise Edition
  • CPU: 4 cores (later increased to 8)
  • GPU: Quadro RTX 4000
  • RAM: 16 GB (later increased to 64GB)

Test setup:

  • ABR with 4 resolutions (2160, 1080, 720, 240)
  • test stream source: IP camera 3840x2160, delivered over rtmp
  • after the initial test RAM was increased to 64GB, then we assigned 4 additional CPU cores

Results:

+---------+-----------+-----------+-----------+------------------------------------------------------------+
| Streams | CPU usage | GPU usage | RAM usage | Notes                                                      |
+---------+-----------+-----------+-----------+------------------------------------------------------------+
|    4    |  ~200%    |  

Notice that neither CPU, GPU nor RAM usage is high enough to cause problems after the increase of resources.

Previously we had AntMedia 2.4.3 and problems only started to arise with 8th or 9th stream. We started to see the problem after upgrading to 2.6.3 (forced b/c of ugrading Ubuntu to 22), after which we had to turn off ABR completely.

kputyra avatar Nov 21 '23 11:11 kputyra

Hi Guys,

I've put it to the backlog with high priority. It's likely that we schedule it soon.

FYI

mekya avatar Nov 27 '23 10:11 mekya

Hello, I've looked into this matter and i've found a solution. Once it's merged, I'll provide an update. If you need it urgently, please let me know.

lastpeony avatar Dec 05 '23 15:12 lastpeony

@lastpeony @burak-58 any update? Do you think this gets fixed in 2.9.x? (I see 2.9.0 was released recently).

alfred-stokespace avatar Apr 10 '24 22:04 alfred-stokespace

Hi @alfred-stokespace,

I remember that we've fixed some issues related to this one.

@burak-58 , could you please update us?

Regards Oguz

mekya avatar May 06 '24 15:05 mekya

Hello @alfred-stokespace I performed some tests. Please find the details here: https://github.com/ant-media/Ant-Media-Server/issues/6389#issuecomment-2186294546 I think you can upgrade to 2.9.0 and try with hwScalingEnabled=false

lastpeony avatar Jun 24 '24 12:06 lastpeony

@lastpeony finally getting into a realistic test. Had some trouble with Nvidia drivers all-of-a-sudden and had to build new servers off 22.04 w/AMS suggested nvidia drivers. But after that quagmire I'm seeing the following ...

Config change ... grep Scaling /usr/local/antmedia/webapps/LiveApp/WEB-INF/red5-web.properties shows...

settings.encoding.hwScalingEnabled=false

Again, this is Ubuntu 22.04

 nvidia-smi
Tue Jun 25 20:17:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0             28W /   70W |    1923MiB /  15360MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:00:1C.0 Off |                    0 |
| N/A   40C    P0             27W /   70W |     672MiB /  15360MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla T4                       Off |   00000000:00:1D.0 Off |                    0 |
| N/A   39C    P0             26W /   70W |     555MiB /  15360MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   39C    P0             27W /   70W |     530MiB /  15360MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java       1916MiB |
|    1   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java        668MiB |
|    2   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java        550MiB |
|    3   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java        526MiB |
+-----------------------------------------------------------------------------------------+

that's with six streams. So far so good. This looks like what I would expect from the 2.4.3 instance (still running like a champe btw!).

alfred-stokespace avatar Jun 25 '24 20:06 alfred-stokespace

and now ramping up to 16 streams (3 renditions per-stream, same streams as noted earlier in the ticket and same rendition details, same cameras)

nvidia-smi
Tue Jun 25 20:23:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1B.0 Off |                    0 |
| N/A   44C    P0             40W /   70W |    5214MiB /  15360MiB |     41%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:00:1C.0 Off |                    0 |
| N/A   41C    P0             31W /   70W |    1602MiB /  15360MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla T4                       Off |   00000000:00:1D.0 Off |                    0 |
| N/A   40C    P0             30W /   70W |    1602MiB /  15360MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   39C    P0             31W /   70W |    1602MiB /  15360MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java       5201MiB |
|    1   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java       1592MiB |
|    2   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java       1592MiB |
|    3   N/A  N/A      9803      C   .../jvm/java-17-openjdk-amd64/bin/java       1592MiB |
+-----------------------------------------------------------------------------------------+

I also see that stream status "Broadcasting 1.00x" is pretty stable on all the streams (that was another sign of problems before, it would flucuate wildly) now it's flucuating between 0.99x and 1.01x which was common for 2.4.3 as well.

I dropped in a few streams with WebRTC player and didn't see any problems.

So,... this is looking really good at the moment.

alfred-stokespace avatar Jun 25 '24 20:06 alfred-stokespace

Hi @alfred-stokespace,

Thank you for your thorough analysis and detailed bug report. It was instrumental in helping us identify and fix the issue.

I'm glad to hear that everything is working as expected on your end now. We're continuously working to improve GPU performance in future releases, while ensuring nothing else is broken :)

I'm closing this issue for now, but please feel free to reopen it if needed.

lastpeony avatar Jun 25 '24 21:06 lastpeony