PhotonLibOS Using photon::semaphore on AWS EC2, `signal` latency is high.

Using photon::semaphore on AWS EC2, signal latency is high.

I have using photon::semaphore on AWS EC2. I found signal function may cost 300~600ms. I don't know what happens and why signal cost so long. I'm seeking assistance in understanding the cause and debugging strategies🙏.

Apr 01 '25 16:04 lucaspeng12138

How many vCPUs are you using to synchronize with this semaphore? How many cores are there in your EC2.
Does this issue appear on other platforms or physical machines?

semaphore signal consists of lock and eventfd write (if across vCPU)

Apr 01 '25 16:04 beef9999

@beef9999 my vcpu count is 12 and actually pod(deploy in k8s) limit 4 CPUs. I get vcpu count by using photon::get_vcpu_num();

I didn't get same issue on other platforms such as Tencent Cloud.

Apr 02 '25 06:04 lucaspeng12138

@beef9999 why my vcpu count is 12: I have three list of Executor, all of them length is 4. And they used to work as server, client and internal jobs.

Apr 02 '25 09:04 lucaspeng12138

@beef9999 "How many vCPUs are you using to synchronize with this semaphore?" I think it is 1. I create a semaphore, and send into a std::thread, then wait it outside, such as following(project works in photon pool):

AsyncRetSharedPtr func A() {
    AsyncRetSharedPtr ret;
    _thread_pool->performance[ret]{
         do_something();
         ret->semaphore.signal();
   }
}

int main() {
    AsyncRetSharedPtr a = A();
    a->semaphore.wait();
}

Apr 02 '25 10:04 lucaspeng12138

@lucaspeng12138 Which branch and event engine are you using? How CPUs does the EC2 instance have?

Apr 03 '25 01:04 lihuiba

@lihuiba I'm using branch v0.7.1. Because we found it has better performance than later version.

EC2 CPU:

Apr 03 '25 02:04 lucaspeng12138

@lucaspeng12138 Could you try branch 0.8 and see whether the huge latency of 300~600ms exists or not? We had a major revision of semaphore since that branch, which may have fixed the issue.

BTW, could you talk more about the performance advantage of 0.7?

Apr 03 '25 05:04 lihuiba

I think the best practice is to set the number of vCPUs to be the same as the number of physical cores (of your EC2 instance). At least do not exceed it.

Since there is a spinlock in the Photon's lock implementation, I'm not sure if there will be performance issues when multiple OS threads are competing on a small set of physical cores.

Apr 03 '25 05:04 beef9999

@beef9999 He has 48 cpus in the EC2 instance.

Apr 03 '25 05:04 lihuiba

@lucaspeng12138 Do you use epoll or io_uring?

Apr 03 '25 05:04 lihuiba

my vcpu count is 12 and actually pod(deploy in k8s) limit 4 CPUs

@lucaspeng12138 4 cpu in total or 4 * 12?

Apr 03 '25 05:04 beef9999

We had a major revision of semaphore since that branch

And we also had another revision of thread scheduler, which improved latency of cross-vcpu interrupt (wake up) of threads. So Trying 0.8 is highly recommended.

Apr 03 '25 06:04 lihuiba

@lucaspeng12138 Do you use epoll or io_uring?

@lihuiba I'm using epoll

my vcpu count is 12 and actually pod(deploy in k8s) limit 4 CPUs

@lucaspeng12138 4 cpu in total or 4 * 12? @beef9999 4 cpu limits/pods, vcpu count is total 12/pods

Apr 03 '25 07:04 lucaspeng12138

I think the best practice is to set the number of vCPUs to be the same as the number of physical cores (of your EC2 instance). At least do not exceed it.

Since there is a spinlock in the Photon's lock implementation, I'm not sure if there will be performance issues when multiple OS threads are competing on a small set of physical cores.

@beef9999 In our case, we have three pool, and it's difficult to use same one. I may try this.

Apr 03 '25 07:04 lucaspeng12138

@lucaspeng12138 Could you try branch 0.8 and see whether the huge latency of 300~600ms exists or not? We had a major revision of semaphore since that branch, which may have fixed the issue.

BTW, could you talk more about the performance advantage of 0.7?

@lihuiba Semaphore optimization is still in main.

@lucaspeng12138 You can try the main branch. And is there any way you can reduce your vCPU num?

Apr 03 '25 08:04 beef9999

@lucaspeng12138 Could you try branch 0.8 and see whether the huge latency of 300~600ms exists or not? We had a major revision of semaphore since that branch, which may have fixed the issue. BTW, could you talk more about the performance advantage of 0.7?

@lihuiba Semaphore optimization is still in main.

@lucaspeng12138 You can try the main branch. And is there any way you can reduce your vCPU num?

Yes, I can decrease the vCPU count by reducing the thread pool size. I will try both your suggestions: cherry-pick optimization from main and reducing the vCPU count.

Apr 03 '25 08:04 lucaspeng12138

@lucaspeng12138 Maybe you can try Photon's WorkPool, instead using a custom ThreadPool. A WorkPool is able to manage all vCPUs. WorkPool uses MPMC queue to dispatch tasks to vCPU.

Apr 03 '25 12:04 beef9999

@lucaspeng12138 Maybe you can try Photon's WorkPool, instead using a custom ThreadPool. A WorkPool is able to manage all vCPUs. WorkPool uses MPMC queue to dispatch tasks to vCPU.

This Photon's WorkPool aim to use all physical CPU and has much better performance than std::thread Pool? And need to know how many physical CPU there is and calling same time "photon::init" right?

Apr 04 '25 03:04 lucaspeng12138

WorkPool will call photon::init for every vCPU it created. You can imagine it as a dispatcher of tasks(functions/lambdas) to vCPUs.

See this demo https://github.com/alibaba/PhotonLibOS/blob/main/thread/test/perf_workpool.cpp

Apr 04 '25 04:04 beef9999

4 cpu limits/pods, vcpu count is total 12/pods

In case there are more vCPUs (12) than physical ones (4), it is possible that spinlocks spend much more CPU time than usual. It is not specific to Photon, but inherent in spinlocks. I'm not sure whether this is the reason, but you can try to give it enough (12) CPU cores for testing and see whether it works.

Apr 05 '25 12:04 lihuiba

@lihuiba Semaphore optimization is still in main.

😂

Apr 07 '25 03:04 lihuiba

4 cpu limits/pods, vcpu count is total 12/pods

In case there are more vCPUs (12) than physical ones (4), it is possible that spinlocks spend much more CPU time than usual. It is not specific to Photon, but inherent in spinlocks. I'm not sure whether this is the reason, but you can try to give it enough (12) CPU cores for testing and see whether it works.

I increased physical CPU require and limit count on pod even larger than 12, this issue still exist. Too busy to continue seeking issue reason, I will try your advice "Use patch photon::sempaphore in main" and tell you test result later.

Apr 09 '25 05:04 lucaspeng12138

I increased physical CPU require and limit count on pod even larger than 12, this issue still exist.

So this precludes the possibility of spinlocks.

Too busy to continue seeking issue reason

Can you give a minimal example that demonstrates the issue?

Apr 14 '25 02:04 lihuiba