freeswitch
freeswitch copied to clipboard
[core] fix session deadlock (#2290)
https://github.com/signalwire/freeswitch/issues/2290
@seven1240 Teacher Du and I explained this issue in detail.
Unit-tests failed: https://public-artifacts.signalwire.cloud/drone/signalwire/freeswitch/1571/artifacts.html
@andywolk cdeveop has a system which left 3-10 dead locked channels per day, which could be fixed by this patch.
Review the patch, it looks like the read_frame parts are fine as the message will be processed on the next read. But I'm worrying about the write parts. What if this happens on a sendonly channel that never has read? or on a App which never read but write ( I don't know such an App in my head though). So the message will never be processed.
Best to find out the root cause of the dead lock instead of the work around I think.
Hello, any news about this commit ?
I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.
#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.
This fix has stopped all deadlocks and subsequently no stuck call center agents.
I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.
#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.
This fix has stopped all deadlocks and subsequently no stuck call center agents.
same problem in my instance. 1.10.11 and 1.10.10 is affected. 1.10.9 is ok Random calls are dead and only core restart helps.
This fix has fixed all deadlocks and stuck calls in our environment.
#2387 and #2390 did not fix the issue.
I can also report that this patch fixes a similar issue with stuck channels and stuck call center agents in my installation.
Why not make it pass the Unit Tests so that it can be merged? Many people seem to be experiencing this problem.
@jakubkarolczyk can you check it?
Can I bump this thread. This issue is real problem and a lot of people have this issue...
Yesterday on Office Hours I asked about this and the related tickets which exhibit a similar problem. BKW said that it is a much deeper problem than is solved by the proposed patch and that they are refarming the code to solve everything at once. So it is a bigger effort than we thought, no ETR.
Before I open another ticket like this. I am curious if the issue I am finding could be related to this one. I have found that I am getting a lot of stuck calls in this scenario. Inbound call does a bg_api command to originate another call. Both calls with join a conference. Though the bg executed originate call will get stuck if it receives a re-invite. In all the cases of these stuck calls I have. the re-invite comes immediately after the answer. Maybe 1 - 4 RTP packets might have been transmitted. The re-invite in some cases was for suggesting another codec. and others were for changing the media IP. but in all cases. the both call legs get stuck. That is specifically they show up when you execute "show calls" What I have found is the first inbound leg will end and show false for uuid_exists, and the outbound call leg will show true, I can run uuid_kill on it, but it will not go away, though after the kill it will result in "no such channel" . Though "show calls" will still display them.
This does seem like something that has started/gotten worse after upgrading from 1.10.7 which we were getting stuck calls but at a low rate. say 10 / week vs 100/day Getting a carrier to change some behaviour reduced the re-invites and does seemed to have directly affected the number of dead calls.
Let me know if I should submit a different report for this I can provide more details there.
Is there any progress/update related to this issue?