icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

AsioConditionVariable#Wait(): don't unnecessarily waste CPU time

Open Al2Klimov opened this issue 3 years ago • 15 comments

by letting the timer expire at a time which theoretically could be reached in #Clear(), so that a #Wait()ing coroutine actually sleeps.

Al2Klimov avatar Sep 06 '22 14:09 Al2Klimov

Please have a look whether this helps in addition to #9513.

Al2Klimov avatar Sep 06 '22 14:09 Al2Klimov

Note: if it (or another large timestamp) does, the still open Redis connections aren’t a problem anymore.

Al2Klimov avatar Sep 06 '22 14:09 Al2Klimov

Isn't the basic claim of this PR that AsioConditionVariable is basically broken and returns immediately even if it should wait? If that was the case, I'd expect way higher CPU utilization all the time, especially as it's also used by JsonRpcConnection. So I think there must be more to it. Is this specific to some Boost Version or something like that?

julianbrost avatar Sep 07 '22 14:09 julianbrost

I don’t see an alternative to this PR in both cases (specific and not specific).

Al2Klimov avatar Sep 07 '22 15:09 Al2Klimov

But this needs more information. I can't follow that claim. What system were you observing this issue on? I tried running icinga2 with the patch below that also stores a flag if the the condition variable was notified and checks this before returning from Wait() (if you want to try it yourself, leave out the log messages, these aren't really helpful, there's just a lot of them) and in my tests, it never returned without it being set first.

Patch
diff --git a/lib/base/io-engine.cpp b/lib/base/io-engine.cpp
index c00bd7819..712b8a9ef 100644
--- a/lib/base/io-engine.cpp
+++ b/lib/base/io-engine.cpp
@@ -132,17 +132,23 @@ AsioConditionVariable::AsioConditionVariable(boost::asio::io_context& io, bool i
 void AsioConditionVariable::Set()
 {
        m_Timer.expires_at(boost::posix_time::neg_infin);
+       m_Notified = true;
+       Log(LogInformation, "AsioConditionVariable") << "set " << this;
 }
 
 void AsioConditionVariable::Clear()
 {
        m_Timer.expires_at(boost::posix_time::pos_infin);
+       m_Notified = false;
+       Log(LogInformation, "AsioConditionVariable") << "clear " << this;
 }
 
 void AsioConditionVariable::Wait(boost::asio::yield_context yc)
 {
        boost::system::error_code ec;
        m_Timer.async_wait(yc[ec]);
+       Log(LogInformation, "AsioConditionVariable") << "wait " << this << " returned";
+       VERIFY(m_Notified);
 }
 
 void Timeout::Cancel()
diff --git a/lib/base/io-engine.hpp b/lib/base/io-engine.hpp
index 684d3ac39..7aca9ba06 100644
--- a/lib/base/io-engine.hpp
+++ b/lib/base/io-engine.hpp
@@ -159,6 +159,7 @@ public:
 
 private:
        boost::asio::deadline_timer m_Timer;
+       bool m_Notified {false};
 };
 
 /**

julianbrost avatar Sep 07 '22 15:09 julianbrost

To answer your very first question: no, wrong claim. The right claim (of mine):

To actually sleep until a timer expires the timer needs to know exactly when it expires. (Positive) infinity is not exact. Boost/libc/kernel internals could have problems with this. Any early returns (hopefully) happen just internally and (as you said about CPU utilisation) not instant.

Al2Klimov avatar Sep 07 '22 16:09 Al2Klimov

Boost/libc/kernel internals could have problems with this.

Please share a combination of these where I can actually observe this being an issue.

julianbrost avatar Sep 08 '22 08:09 julianbrost

tried running icinga2 with the patch below that also stores a flag if the the condition variable was notified and checks this before returning from Wait()

Confirmed!

yhabteab avatar Sep 08 '22 12:09 yhabteab

Please share a combination of these where I can actually observe this being an issue.

Just build and run the container:

https://github.com/Al2Klimov/poc-asio-inf-timer-wastes-cpu

Al2Klimov avatar Sep 12 '22 10:09 Al2Klimov

The mystery goes on.

I'm unable to see any difference using that Docker container:

+ ./runfor.sh 30 /waste
+ P=7
+ exec /waste
+ sleep 30
+ kill 7
/compare.sh: line 5:     7 Terminated              ./runfor.sh 30 /waste

real	0m30.047s
user	0m0.133s
sys	0m0.134s
+ ./runfor.sh 30 /waste not.
+ P=10
+ exec /waste not.
+ sleep 30
+ kill 10
/compare.sh: line 6:    10 Terminated              ./runfor.sh 30 /waste not.

real	0m30.047s
user	0m0.141s
sys	0m0.147s

So I even went on and set up a CentOS 7 VM, also no excessive CPU usage there:

# time ./waste 
^C

real	1m29.769s
user	0m0.280s
sys	0m0.313s

That's a fully updated CentOS 7 with:

  • Kernel: 3.10.0-1160.76.1.el7.x86_64
  • g++: g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
  • Boost: 1.69.0-2.el7

julianbrost avatar Sep 12 '22 14:09 julianbrost

Uhm...

$ docker run --rm -it poc-asio-inf-timer-wastes-cpu
+ ./runfor.sh 30 /waste
+ P=7
+ exec /waste
+ sleep 30
+ kill 7
/compare.sh: line 5:     7 Terminated              ./runfor.sh 30 /waste

real	0m30.057s
user	0m0.281s
sys	0m0.246s
+ ./runfor.sh 30 /waste not.
+ P=10
+ exec /waste not.
+ sleep 30
+ kill 10
/compare.sh: line 6:    10 Terminated              ./runfor.sh 30 /waste not.

real	0m30.043s
user	0m0.173s
sys	0m0.165s
$

Al2Klimov avatar Sep 12 '22 14:09 Al2Klimov

~/Workspace/poc-asio-inf-timer-wastes-cpu (master ✔) docker run --rm -it poc-asio-inf-timer-wastes-cpu     
+ ./runfor.sh 30 /waste
+ P=6
+ exec /waste
+ sleep 30
+ kill 6
/compare.sh: line 5:     6 Terminated              ./runfor.sh 30 /waste

real    0m30.070s
user    0m0.245s
sys     0m0.220s
+ ./runfor.sh 30 /waste not.
+ P=9
+ exec /waste not.
+ sleep 30
+ kill 9
/compare.sh: line 6:     9 Terminated              ./runfor.sh 30 /waste not.

real    0m30.069s
user    0m0.225s
sys     0m0.275s

yhabteab avatar Sep 12 '22 15:09 yhabteab

Uhm...

user	0m0.281s
sys	0m0.246s

[...]

user	0m0.173s
sys	0m0.165s

Is that the difference you'd expect to see here? I'd have expected one to keep a core busy for most of the time or something like that.

julianbrost avatar Sep 12 '22 16:09 julianbrost

I've never claimed that thing is completely broken. (Only half broken.)

Al2Klimov avatar Sep 13 '22 07:09 Al2Klimov

Looks better now!

As I just learned, this was not an isolated test of this PR but rather in combination with #9513. I don't think it's surprising that there's a reduction in CPU usage if the number of generated/processed Redis queries is reduced significantly.

This doesn't provide evidence that this PR has an significant benefit and we're down to "maybe using this might use a few CPU instructions less on some version combination" speculation here.

julianbrost avatar Sep 13 '22 12:09 julianbrost