powerapi icon indicating copy to clipboard operation
powerapi copied to clipboard

Global slow down of powerapi with smartwatt formula

Open PierreRustOrange opened this issue 3 years ago • 4 comments

When running powerapi with smartwatt on a system with more than 30 containers, it runs smoothly for a few minutes and then slow down up to the point where it only processes one message every 20 seconds (or more !), even though the sensor produces many more more reports than that (I've tested the sensor alone with a simple socket server and I don't find any issue there).

I not sure of what is causing this behavior but I have found a few leads that might explain it :

  • powerAPI is using the multiprocQueueBase implementation of the ActorSystem, which, according to the documentation has an "unexplained, core-level concern about dropped messages/deadlocks for the queue messaging in overload conditions." .
  • The SockerDB used by the power_puller (I'm using the socket interface between the sensor and the formula) is starting the server using asyncio, which is discouraged by thespian author (see https://groups.google.com/g/thespianpy/c/8Y1Ggnd_8rM/m/oum7F5kzBwAJ)

Unfortunately I'm not sure the issue I encounter is actually related to these problems.

For full disclosure, I'm running into this with a custom modified powerAPI, although I doubt my modifications have anything to do with the slowdown : I simply add labels to reports (with a modifier actor) and use a custom prometheus exporter.

PierreRustOrange avatar Aug 25 '22 10:08 PierreRustOrange

Hello, Which version of smartwatts are you using ? With the last version you can change the used ActorSystem implementation (cf. https://github.com/powerapi-ng/powerapi/releases/tag/v1.1.0). Default now is simpleSystemBase

roda82 avatar Aug 25 '22 11:08 roda82

I'll try that, thanks ! BTW, do you habe any feedback on each ActorSystem implementation when running powerAPI (besides what's in thespian's documentation) : e.g. which is the fastest, the most reliable, etc. ?

I also think we have issue with logging when using multiprocQueueBase (simple python logging does not work well with multi-process and I've seen cases where log record were lost) and it will be worse with multiprocQueueBase with the current logging implementation. In the base actor we say that we send messages to a logging actor (https://github.com/powerapi-ng/powerapi/blob/v1.1.0/powerapi/actor.py#L101) but actually we simply use the standard python logging framework.

PierreRustOrange avatar Aug 25 '22 12:08 PierreRustOrange

I was using a version of powerapi just before this change ! I've switched to simpleSystemBase and it improves things a lot, the slowdown seems to be gone :) thanks !

PierreRustOrange avatar Aug 26 '22 07:08 PierreRustOrange

Great! Concerning your question related to different ActorSystem implementations, we tested them with 100.000 reports. Below you have a summary of results:

  1. MultiprocQueueBase
  • Lost messages: A sleep (0.25s) is required for avoiding it Same problem has been detected in the Inria Demo. Known Performance issue (cf. TheSpian Documentation)
  • Execution time: unknown as messages are lost
  1. simpleSystemBase
  • Execution time: 90s.
  • Mono process and synchronous
  • No lost messages
  1. multiprocTCPBase:
  • Execution time: 50 minutes.
  • No lost messages

As you stated, there is an issue with logging regarding MultiprocQueueBase implementation that is no present in simpleSystemBase. We observed that warnings were missing with the former but not with the latter. Therefore, we also need to improve the log management in PowerAPI when using multiprocessing.

roda82 avatar Aug 26 '22 09:08 roda82

Hello, This problem has been fixed since version 2.0.0 of PowerAPI. Please open a new issue if you encounter other performance problems.

gfieni avatar May 05 '23 14:05 gfieni