freeswitch icon indicating copy to clipboard operation
freeswitch copied to clipboard

[core] fix session deadlock (#2290)

Open cdevelop opened this issue 1 year ago • 14 comments

https://github.com/signalwire/freeswitch/issues/2290

cdevelop avatar Nov 05 '23 02:11 cdevelop

@seven1240 Teacher Du and I explained this issue in detail.

cdevelop avatar Nov 05 '23 02:11 cdevelop

Unit-tests failed: https://public-artifacts.signalwire.cloud/drone/signalwire/freeswitch/1571/artifacts.html

signalwire-ci[bot] avatar Nov 05 '23 02:11 signalwire-ci[bot]

@andywolk cdeveop has a system which left 3-10 dead locked channels per day, which could be fixed by this patch.

Review the patch, it looks like the read_frame parts are fine as the message will be processed on the next read. But I'm worrying about the write parts. What if this happens on a sendonly channel that never has read? or on a App which never read but write ( I don't know such an App in my head though). So the message will never be processed.

Best to find out the root cause of the dead lock instead of the work around I think.

seven1240 avatar Nov 05 '23 03:11 seven1240

Hello, any news about this commit ?

bferreirq avatar Feb 06 '24 10:02 bferreirq

I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.

#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.

This fix has stopped all deadlocks and subsequently no stuck call center agents.

KerryRJ avatar Feb 28 '24 08:02 KerryRJ

I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.

#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.

This fix has stopped all deadlocks and subsequently no stuck call center agents.

same problem in my instance. 1.10.11 and 1.10.10 is affected. 1.10.9 is ok Random calls are dead and only core restart helps.

televoicepl avatar Mar 12 '24 21:03 televoicepl

This fix has fixed all deadlocks and stuck calls in our environment.

#2387 and #2390 did not fix the issue.

shaunjstokes avatar Apr 04 '24 06:04 shaunjstokes

I can also report that this patch fixes a similar issue with stuck channels and stuck call center agents in my installation.

bfroemel avatar Apr 16 '24 08:04 bfroemel

Why not make it pass the Unit Tests so that it can be merged? Many people seem to be experiencing this problem.

boteman avatar Apr 28 '24 20:04 boteman

@jakubkarolczyk can you check it?

televoicepl avatar Apr 29 '24 22:04 televoicepl

Can I bump this thread. This issue is real problem and a lot of people have this issue...

gregoriusus avatar May 07 '24 20:05 gregoriusus

Yesterday on Office Hours I asked about this and the related tickets which exhibit a similar problem. BKW said that it is a much deeper problem than is solved by the proposed patch and that they are refarming the code to solve everything at once. So it is a bigger effort than we thought, no ETR.

boteman avatar May 08 '24 17:05 boteman

Before I open another ticket like this. I am curious if the issue I am finding could be related to this one. I have found that I am getting a lot of stuck calls in this scenario. Inbound call does a bg_api command to originate another call. Both calls with join a conference. Though the bg executed originate call will get stuck if it receives a re-invite. In all the cases of these stuck calls I have. the re-invite comes immediately after the answer. Maybe 1 - 4 RTP packets might have been transmitted. The re-invite in some cases was for suggesting another codec. and others were for changing the media IP. but in all cases. the both call legs get stuck. That is specifically they show up when you execute "show calls" What I have found is the first inbound leg will end and show false for uuid_exists, and the outbound call leg will show true, I can run uuid_kill on it, but it will not go away, though after the kill it will result in "no such channel" . Though "show calls" will still display them.

This does seem like something that has started/gotten worse after upgrading from 1.10.7 which we were getting stuck calls but at a low rate. say 10 / week vs 100/day Getting a carrier to change some behaviour reduced the re-invites and does seemed to have directly affected the number of dead calls.

Let me know if I should submit a different report for this I can provide more details there.

zooptwopointone avatar May 21 '24 05:05 zooptwopointone

Is there any progress/update related to this issue?

hhadzem avatar Jul 23 '24 09:07 hhadzem