doris
doris copied to clipboard
[Enhancement] better merge single_replica_load_rpc_service into brpc_service
Search before asking
- [X] I had searched in the issues and found no similar issues.
Description
better merge single_replica_load_rpc_service into brpc_service
refer to these to see what is single replica load
:
single replica load design
single replica implement
The code would be much clarified if there is only one rpc service. The reason why the original code uses two services is that rpc will time out under heavy workloads if use just one.
However, according to the design of brpc thread model, even if running in separated service thread pools, the underlying pthreads/cpu cores are shared by the two services. I think the reason why the two services solution works is only because it claims more worker pthreads, which can also be achieved by simply setting bigger --bthread_concurrency
gflag config.
So two services achieve nothing but messyness. We should merge them into one and identify&fix the root cause of the timeout problem.
Solution
step1. merge single_replica_load_rpc_service into brpc_service
step2. test the system with heavy load jobs to try to reproduce the rpc timeout problem
step3. make assumptions & proving
some of my guesses:
void PInternalServiceImpl::request_slave_tablet_pull_rowset(...) { ... worker_pool.offer([=]() { ... } ... }
if the worker pool is full (possible under heavy loads), the pthread will wait under std::condition_variable. If most of the pthreads in the pool are blocked, there would be few pthread to deal with rpc request and then timeout happens.
If it is true, we need to use bthread's condition variable instead of std::condition_variable. The former one will only yeild the underlying pthread to other rpc jobs, rather than blocking it.
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct