server icon indicating copy to clipboard operation
server copied to clipboard

Really high CPU load over time

Open dotarmin opened this issue 4 years ago • 28 comments

Expected behaviour

Be able to play clips, both long and short without having to worry about the CPU load.

Current behaviour

When playing shorter clips using v2.3.0 LTS (even in v2.2.0), the CPU load goes to 90-92% over time and is stuck there. I have attached some screens to show how it looks like. For longer clips we do not see this behaviour.

Shorter clips = around 20 seconds Longer clips = hours

I think it has to do with the number of commands sent and that it's not related to the actual file length, but it's just a theory.

  • v2.3.0 LTS does not crash when this happen
  • 2.2.0 does crash when this happen
  • v2.0.7 - Works
Used commands (from automation system)

LOAD
PLAY

LOAD
PLAY

Environment

  • Server version: v2.3.0 LTS
  • Operating system: Windows 7 x64
  • 8 decklink channels (fill only) configured but only 2 actively used

Screenshots

image01

image02

image03

image04

dotarmin avatar Dec 18 '20 07:12 dotarmin

We are experiencing that too. After some period CasparCG 2.3 LTS process stucks at 99% and then fails. Even after STOPping all layers and playing only one then.

TondaKrist avatar Jan 08 '21 14:01 TondaKrist

I have seen this too.

ronag avatar Jan 08 '21 15:01 ronag

Does anyone have reliable repro steps?

ronag avatar Jan 08 '21 15:01 ronag

@scriptorian is able to reproduce this and is having a look into the cause

Julusian avatar Jan 08 '21 15:01 Julusian

Seems like it can be reproduced by issuing multiple LOAD and PLAY commands over time.

hummelstrand avatar Jan 08 '21 15:01 hummelstrand

Reproducable after multiple PLAY and LOADBG commands over time as @hummelstrand mentioned - even on single layer. I will prepare commands log to reproduce.

TondaKrist avatar Jan 08 '21 15:01 TondaKrist

As mentioned I have managed to reproduce this with a test script that repeatedly LOADs a clip onto a channel/layer (using the ffmpeg producer). No PLAY is required to provoke the fault. For testing I have made the script loop every 200ms and this makes the problem apparent in a reasonable amount of time. The first symptom is the process working set increasing linearly, then after a few minutes the CPU load starts increasing too.

I have analysed the application using various tools and confirmed that it is working well and not leaking any threads or objects on the heap (with the exception of one rare bug that I have addressed - not relevant to this problem) which is great news but frustrating in terms of finding the problem. I recently tried running Windows Performance Analyzer and finally found a clue. By comparing CPU usage early and late in a run it was apparent that an increasing amount of time was spent in the TBB library and with cleaning up thread local storage. With some very simple (and not production ready!) hacking I removed the TBB thread parallel optimisations in the ffmpeg producer and the memory and CPU growth problem disappeared.

I don't believe there is anything wrong with the CasparCG code that uses this library so my next step will be to get an updated version of the TBB library and try again with that. The release notes mention some bugfixes that may be relevant. Intel have now wrapped it into their new oneAPI product and installing that failed for me just now. If anyone here has experience of this library (@ronag?) I'd be grateful for any pointers for how you cooked it / downloaded it last time.

scriptorian avatar Jan 20 '21 12:01 scriptorian

Try skipping the custom tbb stuff and use the regular ffmpeg thread pool?

ronag avatar Jan 20 '21 13:01 ronag

Thanks @ronag. If you are referring to to the override of AVFilterGraph::execute that is currently using TBB as the custom multithreading implementation then yes, I have turned this off. The real difference with this problem though is in the tbb::parallel_invoke and tbb::parallel_for_each calls in av_producer and av_util. Removing these stops the problem, removing just one of them halves the rate of growth!

scriptorian avatar Jan 20 '21 14:01 scriptorian

For now just remove the tbb stuff. We can follow up with another PR with an updated tbb version later.

ronag avatar Jan 20 '21 16:01 ronag

I don't know how to update tbb at the moment since intel wrapped it into oneAPI.

ronag avatar Jan 20 '21 16:01 ronag

on windows you can also try https://docs.microsoft.com/en-us/cpp/parallel/concrt/how-to-write-a-parallel-for-loop?view=msvc-160

ronag avatar Jan 20 '21 16:01 ronag

Do we know if this problem occurs on Linux?

ronag avatar Jan 20 '21 16:01 ronag

Thanks for the suggestions. I've got hold of the latest tbb now and I think the best approach is to push through with trying that. If the problem has gone away then there are no code changes (any tbb interface changes notwithstanding) and linux should continue to work - hopefully without any problems. Any other approach would require a fair amount of code changes with potentially surprising impacts on performance and that seems like something to avoid if possible.

scriptorian avatar Jan 20 '21 16:01 scriptorian

Sorry, is it something we can fix via some TBB tweaking in Windows, or not?

TondaKrist avatar Jan 25 '21 12:01 TondaKrist

I have now downloaded and built with the latest TBB library from the Intel oneAPI product. There were some API changes but dealing with these was straightforward and should be safe. The good news is that this completely fixed the growing CPU and memory problems. I have left my test script running for a good long time and everything stayed very steady.

scriptorian avatar Jan 25 '21 12:01 scriptorian

Awesome, will it be included in some future builds of CasparCG? Or can you please provide your build for long time testing?

TondaKrist avatar Jan 25 '21 12:01 TondaKrist

We are just discussing how to progress with testing this change and whether to make a beta version. Does anyone here have any thoughts? I'll update this thread when we have a plan!

scriptorian avatar Jan 25 '21 12:01 scriptorian

Please beta test and report any issues here! https://github.com/CasparCG/server/releases/tag/v2.3.2-lts-beta

hummelstrand avatar Jan 25 '21 19:01 hummelstrand

Is this something to worry about on Linux? (Running NRK version).

dimitry-ishenko avatar Jan 25 '21 21:01 dimitry-ishenko

It's not clear whether the TBB bug also exists in the Linux version. The TBB release notes include some mentions of fixing relevant bugs in the Windows version so there is reasonable hope that this problem won't affect Linux. The updated TBB library is available for Linux so it should be straightforward to make an updated build if problems appear.

scriptorian avatar Jan 26 '21 08:01 scriptorian

Is this something to worry about on Linux? (Running NRK version).

The latest NRK version of CasparCG Server is v2.1, so it is not affected by this bug which seems to have been introduced in v2.2.

hummelstrand avatar Jan 26 '21 09:01 hummelstrand

OK I get it. Thank you @scriptorian and @hummelstrand

dimitry-ishenko avatar Jan 26 '21 15:01 dimitry-ishenko

Just FYI: It seems there is no problem with increasing CPU load on 2.3.2 beta on Windows 10 (yellow lines). There is just a slight memory usage increase over time but from my experience, it will eventually drop.

Green lines belong to a custom 2.3.0 build running on Debian. Both servers use LOADBG/AUTO to play mixed (Linux) and XDCAM HD (Windows) playlists.

shot-justreadtheinstructions-20210129-111714

martastain avatar Jan 29 '21 10:01 martastain

I have to confirm, that this build fixes CPU usage leak on Windows (both Intel and AMD currently running 5 days 24/7). Thanks guys, awesome job in investigation and fix.

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled. I will start a new thread for that.

TondaKrist avatar Jan 29 '21 10:01 TondaKrist

Unfortunately I have experienced memery leak on GPU when HTML tempalte GPU acceleration is enabled.

I have also encountered this.

ronag avatar Jan 29 '21 11:01 ronag

@TondaKrist or @ronag, can you please create an issue for this of not already done? Thanks

Never mind, already done, thanks!

dotarmin avatar Jan 29 '21 12:01 dotarmin

This is off-topic, but Beta-version v2.3.2-lts-beta also has audio issues on systems that use the 1001-based-standard. #1326 already has a solution to the audio issue, and I hope users using NTSC can participate in this test. Thanks~~

sendust avatar Feb 01 '21 03:02 sendust