icinga2
icinga2 copied to clipboard
AsioConditionVariable#Wait(): don't unnecessarily waste CPU time
by letting the timer expire at a time which theoretically could be reached in #Clear(), so that a #Wait()ing coroutine actually sleeps.
Please have a look whether this helps in addition to #9513.
Note: if it (or another large timestamp) does, the still open Redis connections aren’t a problem anymore.
Isn't the basic claim of this PR that AsioConditionVariable is basically broken and returns immediately even if it should wait? If that was the case, I'd expect way higher CPU utilization all the time, especially as it's also used by JsonRpcConnection. So I think there must be more to it. Is this specific to some Boost Version or something like that?
I don’t see an alternative to this PR in both cases (specific and not specific).
But this needs more information. I can't follow that claim. What system were you observing this issue on? I tried running icinga2 with the patch below that also stores a flag if the the condition variable was notified and checks this before returning from Wait() (if you want to try it yourself, leave out the log messages, these aren't really helpful, there's just a lot of them) and in my tests, it never returned without it being set first.
Patch
diff --git a/lib/base/io-engine.cpp b/lib/base/io-engine.cpp
index c00bd7819..712b8a9ef 100644
--- a/lib/base/io-engine.cpp
+++ b/lib/base/io-engine.cpp
@@ -132,17 +132,23 @@ AsioConditionVariable::AsioConditionVariable(boost::asio::io_context& io, bool i
void AsioConditionVariable::Set()
{
m_Timer.expires_at(boost::posix_time::neg_infin);
+ m_Notified = true;
+ Log(LogInformation, "AsioConditionVariable") << "set " << this;
}
void AsioConditionVariable::Clear()
{
m_Timer.expires_at(boost::posix_time::pos_infin);
+ m_Notified = false;
+ Log(LogInformation, "AsioConditionVariable") << "clear " << this;
}
void AsioConditionVariable::Wait(boost::asio::yield_context yc)
{
boost::system::error_code ec;
m_Timer.async_wait(yc[ec]);
+ Log(LogInformation, "AsioConditionVariable") << "wait " << this << " returned";
+ VERIFY(m_Notified);
}
void Timeout::Cancel()
diff --git a/lib/base/io-engine.hpp b/lib/base/io-engine.hpp
index 684d3ac39..7aca9ba06 100644
--- a/lib/base/io-engine.hpp
+++ b/lib/base/io-engine.hpp
@@ -159,6 +159,7 @@ public:
private:
boost::asio::deadline_timer m_Timer;
+ bool m_Notified {false};
};
/**
To answer your very first question: no, wrong claim. The right claim (of mine):
To actually sleep until a timer expires the timer needs to know exactly when it expires. (Positive) infinity is not exact. Boost/libc/kernel internals could have problems with this. Any early returns (hopefully) happen just internally and (as you said about CPU utilisation) not instant.
Boost/libc/kernel internals could have problems with this.
Please share a combination of these where I can actually observe this being an issue.
tried running icinga2 with the patch below that also stores a flag if the the condition variable was notified and checks this before returning from
Wait()
Confirmed!
Please share a combination of these where I can actually observe this being an issue.
Just build and run the container:
https://github.com/Al2Klimov/poc-asio-inf-timer-wastes-cpu
The mystery goes on.
I'm unable to see any difference using that Docker container:
+ ./runfor.sh 30 /waste
+ P=7
+ exec /waste
+ sleep 30
+ kill 7
/compare.sh: line 5: 7 Terminated ./runfor.sh 30 /waste
real 0m30.047s
user 0m0.133s
sys 0m0.134s
+ ./runfor.sh 30 /waste not.
+ P=10
+ exec /waste not.
+ sleep 30
+ kill 10
/compare.sh: line 6: 10 Terminated ./runfor.sh 30 /waste not.
real 0m30.047s
user 0m0.141s
sys 0m0.147s
So I even went on and set up a CentOS 7 VM, also no excessive CPU usage there:
# time ./waste
^C
real 1m29.769s
user 0m0.280s
sys 0m0.313s
That's a fully updated CentOS 7 with:
- Kernel: 3.10.0-1160.76.1.el7.x86_64
- g++: g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
- Boost: 1.69.0-2.el7
Uhm...
$ docker run --rm -it poc-asio-inf-timer-wastes-cpu
+ ./runfor.sh 30 /waste
+ P=7
+ exec /waste
+ sleep 30
+ kill 7
/compare.sh: line 5: 7 Terminated ./runfor.sh 30 /waste
real 0m30.057s
user 0m0.281s
sys 0m0.246s
+ ./runfor.sh 30 /waste not.
+ P=10
+ exec /waste not.
+ sleep 30
+ kill 10
/compare.sh: line 6: 10 Terminated ./runfor.sh 30 /waste not.
real 0m30.043s
user 0m0.173s
sys 0m0.165s
$
~/Workspace/poc-asio-inf-timer-wastes-cpu (master ✔) docker run --rm -it poc-asio-inf-timer-wastes-cpu
+ ./runfor.sh 30 /waste
+ P=6
+ exec /waste
+ sleep 30
+ kill 6
/compare.sh: line 5: 6 Terminated ./runfor.sh 30 /waste
real 0m30.070s
user 0m0.245s
sys 0m0.220s
+ ./runfor.sh 30 /waste not.
+ P=9
+ exec /waste not.
+ sleep 30
+ kill 9
/compare.sh: line 6: 9 Terminated ./runfor.sh 30 /waste not.
real 0m30.069s
user 0m0.225s
sys 0m0.275s
Uhm...
user 0m0.281s sys 0m0.246s[...]
user 0m0.173s sys 0m0.165s
Is that the difference you'd expect to see here? I'd have expected one to keep a core busy for most of the time or something like that.
I've never claimed that thing is completely broken. (Only half broken.)
Looks better now!
As I just learned, this was not an isolated test of this PR but rather in combination with #9513. I don't think it's surprising that there's a reduction in CPU usage if the number of generated/processed Redis queries is reduced significantly.
This doesn't provide evidence that this PR has an significant benefit and we're down to "maybe using this might use a few CPU instructions less on some version combination" speculation here.