brpc icon indicating copy to clipboard operation
brpc copied to clipboard

Bthread: Lots of CPU cost on `do_futex` when running lightweight tasks

Open mapleFU opened this issue 3 years ago • 14 comments

Describe the bug (描述bug)

Hi, we are using brpc and bthread as our rpc framework and runtime. Our tasks are lightweight, the workload is like handling a request and read some data in memory, usally finished in ~10ms.

Under 96Core CPU, we config out bthread worker 106. And when the workload is lots of lightweight request (about 300K request per second), pprof shows that do_futex takes about 25% of CPU runtime. And bthread_worker_usage is only 15-20, bthread_signal_second is also high, bthread_count is about 1600, and our server qps is 300k. Some information in pprof can be listed as follow:

bthread::TaskGroup::end_sched 18.09%
- steal_task 5.61%
- sched_to 12.07%
  -  ready_to_run  11.85%
    - do_futex 11.16% (call futex_wake)
      - _raw_spin_unlock_irqrestore 10.42%  

and:

bthread::TaskGroup::run_main_task 14.4%
- TaskGroup::wait_task 4.87%
  - steal_task 4.63%
- futex_wait 8.42% 

According to link, _raw_spin_unlock_irqrestore because interrupt is off. But it still takes too much time handling this than we expected on scheduling.

We guess that we produce too many lightweight bthread, and scheduling them will notify lots of TaskGroup workers. After changing bthread_worker to 60 and restart the server, the cost of scheduling reduce a lot. But restarting all machines is troblesome for us. And we think that configing worker number as same as hardware_concurrency is suitable for all different kinds of workloads.

How can we handling this problem? I found bthread can only add_worker dynamically, but cannot remove spare worker, which can solve this problem easily. Using a bthread pool may help to reducing the signal and bthread scheduling, but writing a ThreadPool over Fiber is really a dirty work.

To Reproduce (复现方法)

Expected behavior (期望行为)

The bthread can reduce worker, or spend less time on do_futex when there are many lightweight tasks.

Versions (各种版本) OS: Linux 5.4 Compiler: g++ 830 brpc: 0.9.6 protobuf: We use thrift 0.9

Additional context/screenshots (更多上下文/截图)

mapleFU avatar Mar 24 '22 08:03 mapleFU

Update: After hacking some brpc code and restart the machine, we downsample the signal in bthread, and the performance works better.

mapleFU avatar Mar 24 '22 11:03 mapleFU

Update: After hacking some brpc code and restart the machine, we downsample the signal in bthread, and the performance works better.

what hacks did you make?

TousakaRin avatar Mar 28 '22 07:03 TousakaRin

Update: After hacking some brpc code and restart the machine, we downsample the signal in bthread, and the performance works better.

what hacks did you make?

We add a gflag FLAG_no_signal_sample, which can add a NOSIGNAL to bthread's flag

mapleFU avatar Mar 29 '22 07:03 mapleFU

Why put bthread_flush() to src/bthread/unstable.h?

Under what conditions NOSIGNAL is he better? Are there any experimental data?

renguoqing avatar Mar 29 '22 08:03 renguoqing

Why put bthread_flush() to src/bthread/unstable.h?

Under what conditions NOSIGNAL is he better? Are there any experimental data?

@renguoqing

I'm not a brpc committer, So I don't know why bthread_flush is unstable.

And our program will call bthread_flush in some conditions, so we think it's safe to using nosignal here. It may make latency grow a little, but it works well in our program.

We change no_signal as a gflag, and we can sampling it, like:

if (FLAG_no_signal_sample != 1 && fastrand() < FLAG_no_signal_sample) {
   // mark nosignal
}

FLAG_no_signal_sample's default value is 1. So in most case it will not work. We will adjust it until it take less time on do_futex and still have low latency.

mapleFU avatar Mar 29 '22 09:03 mapleFU

The experimental data depends on workload. We're running so many lightweight task here, so we can downsample it to 1%. If there are lots of CPU boundary task, we think 1 is ok.

mapleFU avatar Mar 29 '22 09:03 mapleFU

@mapleFU This is quite a good practice. Can you contribute a PR? That might help more people.

wwbmmm avatar Mar 29 '22 09:03 wwbmmm

Why put bthread_flush() to src/bthread/unstable.h? Under what conditions NOSIGNAL is he better? Are there any experimental data?

@renguoqing

I'm not a brpc committer, So I don't know why bthread_flush is unstable.

And our program will call bthread_flush in some conditions, so we think it's safe to using nosignal here. It may make latency grow a little, but it works well in our program.

We change no_signal as a gflag, and we can sampling it, like:

if (FLAG_no_signal_sample != 1 && fastrand() < FLAG_no_signal_sample) {
   // mark nosignal
}

FLAG_no_signal_sample's default value is 1. So in most case it will not work. We will adjust it until it take less time on do_futex and still have low latency.

Thank you. @mapleFU

I also try change SIGNAL to NOSIGNAL in our program a few days ago, but the performance became worse.

Our program traffic is not as high as yours, so it may not be suitable.

renguoqing avatar Mar 29 '22 09:03 renguoqing

@mapleFU This is quite a good practice. Can you contribute a PR? That might help more people.

Glad to contribute to brpc. But the code maybe ugly. I'll try to submit it this weekend

mapleFU avatar Mar 29 '22 11:03 mapleFU

I met the same case. I almost did the same thing as @mapleFU mentioned above. I guess we can do better.

  1. If there are already enough bthread worker to saturate CPU, just ignore signal calls. More workers help nothing.

  2. Handcraft waiter link list by brpc self instead of futex, although we still need futex to wakeup pthread. The benefit is that we can only wake bthread workers that just have received remote task(i.e from non-bthread-worker thread). No steal() is needed.

  3. NUMA-awareness

JimChengLin avatar Apr 01 '22 09:04 JimChengLin

I have been thinking those improvements for a while and may try to impl those features in the next half year. Any help is appreciated!

JimChengLin avatar Apr 01 '22 09:04 JimChengLin

@mapleFU This is quite a good practice. Can you contribute a PR? That might help more people.

Glad to contribute to brpc. But the code maybe ugly. I'll try to submit it this weekend

PR link plz?thanks~

LostTong avatar Jun 09 '22 09:06 LostTong