bus icon indicating copy to clipboard operation
bus copied to clipboard

Timeout used by park_timeout seems way too short, causes heavy CPU use

Open hutch opened this issue 7 years ago • 13 comments

I have a small test program (implementing a component of a real application) using Bus that was idling at about 100-110% CPU utilization (1+ cores of 8), which is excessive.

When I changed the timeouts used by the two calls to park_timeout to be 1ms (Duration::from_millis(1)) rather than 1us currently in the official release, the CPU utilization under a pretty realistic but on the heavy side real-world load (not just idle, and including some computation) fell to about 3.5%.

I think this 3.5% CPU utilization is pretty impressive. Nice library you've got there :-)

hutch avatar Aug 03 '16 11:08 hutch

I agree. The goal is to switch to adaptive backoff spinning, but I'd prefer to not implement it myself. I'm half-waiting for mvdnes/spin-rs#29 to be fixed, and then simply use the SpinWait that library would then expose. I don't think I'd want to change the static wait time to 1ms, as that means increasing the expected wait time for very busy applications fairly significantly. If you want to give writing an exponential back-off system a go though, I'd be happy to look at a PR!

Also, thanks :)

jonhoo avatar Aug 04 '16 10:08 jonhoo

Is this really fixable by backoff spinning? I mean, spinning burns CPU instead of properly giving away the thread time slice.

I think the proper fix would be to use some sort of CondVar or other real parking/unparking support.

arthurprs avatar Aug 15 '16 14:08 arthurprs

Given that there aren't any locks, CondVar won't work. This is also the reason why we need to use park_timeout instead of just park (though note that we are using park/unpark), as there are races where the other peer won't be woken up. Notice that we're not using a spin_lock_, it's more like we are implementing CondVar ourselves. Spinning initially, and then parking for increasingly long times is a pretty sensible approach (parking_lot does something similar).

jonhoo avatar Aug 22 '16 08:08 jonhoo

@hutch can you give it a try with the new SpinWait now in place?

jonhoo avatar Aug 29 '16 19:08 jonhoo

I think I have the high cpu usage problem with 1.1.0.

bbigras avatar Sep 01 '16 18:09 bbigras

@brunoqc Try 1.1.1?

jonhoo avatar Sep 01 '16 20:09 jonhoo

It seems better. The average is about 15% and it peaks at about 30%. With 1.1.0 I had peaks at about 70%.

Do you think the cpu usage could be even lower?

bbigras avatar Sep 01 '16 21:09 bbigras

Well, it depends what you're doing really. If you're sending and receiving in a tight loop, I'd expect it to hit 100%. If you're doing other compute-heavy work, I'd expect the overhead of bus to be marginal.

jonhoo avatar Sep 01 '16 21:09 jonhoo

Using bus 1.1.3 on a Raspberry Pi, each thread doing a recv() uses 15-35% CPU with the parking time set to 1 uS. If I set it to 1 mS, I have trouble measuring the usage because it's so low.

xobs avatar Mar 02 '17 06:03 xobs

@xobs I'm confused -- the default parking time in bus now is 100µs (thread::park_timeout(Duration::new(0, 100000)); 100000ns = 100µs). Your diff (https://github.com/xobs/bus/commit/c8bb287df261cf038ac431a71b416ae996325d1c) is using 100000000ns = 100000µs = 100ms.

jonhoo avatar Mar 02 '17 16:03 jonhoo

@jonhoo You're right, I set it to 100ms instead of 1ms.

With the default 100µs value, a single instance of waiting in BusReader.read() burns 16% according to "top -H". If I add more BusReaders then the program's usage goes up with each instance.

The accidental high-latency in my own patch is actually fine for my purposes where, but I'll check what happens when I set it to 1mS instead of 100mS.

xobs avatar Mar 03 '17 02:03 xobs

With the default of 100 uS, the output of "top -H" looks like this with three listeners (in a thread called B-Hook):

  688 pi        20   0   46568  13964   5780 S 16.2  1.4   0:04.48 B-Hook
  683 pi        20   0   46568  13964   5780 S 15.8  1.4   0:04.48 B-Hook
  685 pi        20   0   46568  13964   5780 S 15.5  1.4   0:04.49 B-Hook
  677 pi        20   0    8604   3380   2872 R  1.3  0.3   0:00.44 top

There are four cores on this system, so each thread is being assigned one core. The bus is idle, so they are burning that while simply polling for data.

If I correct my patch and set the poll time to 1 mS, output looks like the following:

 1405 pi        20   0   40424   6864   5732 S  2.3  0.7   0:00.30 B-Hook
 1407 pi        20   0   40424   6864   5732 S  2.3  0.7   0:00.31 B-Hook
 1410 pi        20   0   40424   6864   5732 S  2.3  0.7   0:00.27 B-Hook
  677 pi        20   0    8604   3380   2872 R  1.0  0.3   0:03.42 top

Each instance of bus.recv() is still the busiest thing on the system, but it's much better now.

I'd like to be able to increase the value to 10 mS or even up to 100 mS, but 100 uS is definitely using too much power.

xobs avatar Mar 06 '17 04:03 xobs

So, the reason I'm hesitant to raise the timeout is that, as the comment above alludes to, the timeout dictates how long after a free slot opens up a thread might notice that this happened. Or, to put it differently, with a timeout of 100ms, a reader or writer may, in particular racy conditions, block on a read or write 100ms after it should have unblocked and continued. That's a long time for performance critical systems... I guess one thing to do here would be to make the timeout configurable per Bus so you'd have some control over it.. @hutch @brunoqc thoughts about this?

jonhoo avatar Mar 06 '17 04:03 jonhoo