charm
charm copied to clipboard
Modifies converse schedeuler to prioritize NodeGroup messages
Modifies converse schedeuler's getNextMessage so nodeGroup messages can run with higher priority over local closes #3674 As it is, nodeGroup messages are not checked until all local and regular Charm queue (prio Q) messages are checked, which cause issues when the applicaiton is using nodeGroup messages in the hope that some PE will attend to it quickly. The change makes getNextMessage check nodeGroup queue every 2^nodeGrpFreq iterations with high priority in addition to its usual check after exhasuting local queues (except task Q). This commit has not been tested at all. But pusing it to allow others to help me test/fix it.
One thing to worry about is whether this change causes performance degradation by making the scheduler check the nodequeue too often (depend on whether the check is expensive, even for an empty queue, because of locking). It'd be nice if someone were to run a performance regression test.
@ericjbohm @ericmikida @ZwFink review please.
I thought this was already merged. @ZwFink @ericjbohm .. will you please take a look?
@lvkale this is the code we discussed during the meeting today
@lvkale this is the code we discussed during the meeting today
Sorry for the spam, but where is the meeting usually announced?
Should I consider all the older unresolved comments here as acceptably resolved?
In order for this to be mergeable it should be modified to no longer be a draft and the reviewer comments should be addressed and resolved. @lvkale
One thing to worry about is whether this change causes performance degradation by making the scheduler check the nodequeue too often (depend on whether the check is expensive, even for an empty queue, because of locking). It'd be nice if someone were to run a performance regression test.
Was that performance analysis done? If so, what were the results?
One thing to worry about is whether this change causes performance degradation by making the scheduler check the nodequeue too often (depend on whether the check is expensive, even for an empty queue, because of locking). It'd be nice if someone were to run a performance regression test.
Was that performance analysis done? If so, what were the results?
Results of a benchmark that has one chare sending 10 messages to itself 500,000 times (SMP mode, higher is better). The entry method just increments a counter to track whether it should send another 10 messages to itself, so it should stress the queueing system. This result shows an overhead of $0.44$% averaged over 50 runs. This translates to the overhead of a few $ns$ per message. Any realistic, small tasks are on the order of a few $us$ per message. I think we should pay more attention to the fact that the scheduler can only churn through ~1.2million local entry methods per second than the addition of a few nanoseconds of overhead.
Another possibly interesting point is that we run with 6 virtual nodes per physical node on our 192-cores-per-node machine because of better communication. This seems counterintuitive with the model I think Sanjay described that intranode communication shouldn't be affected by the comm thread. Maybe there's a lot of locking going on?