ex_webrtc
ex_webrtc copied to clipboard
Unstable Behaviour of Fly.io Deployment
I have a fly.io deployment of an audio streaming service largely adapted from live_ex_webrtc.
There's a heisenbug on both the publisher and player ends where sometimes they can start/join streams and sometimes that fails.
I can confirm from IO.inspects in my forks of ex_webrtc and ex_ice that when it fails, valid candidate pairs are absent in the checklist of the ICEAgents.
Why that happens on fly.io and not locally and how to fix this is still unclear. Any help will be greatly appreciated.
For context, I'm running on two shared-cpu-1x@1024MB instances. Changing the machine specs didn't resolve the issue. Upgrading to a dedicated IPv4 address also didn't resolve it.
I'm happy to provide any other information that can help.
Hi @kingdomcoding , is your app deployed right now? Could you provide URL? Also, could you deploy your app with debug logs and capture them (especially from ex_ice) when your connection fails? And one more question, do you try to connect via VPN or some other non-standard network?
Hey @mickel8
The app is live, with the publisher page and the listener page and a sample of the logs
The behaviour with a VPN is unchanged
@kingdomcoding I assume that you use https://hexdocs.pm/ex_webrtc/ExWebRTC.ICE.FlyIpFilter.html?
From logs, it looks like we can't get a response to any connectivity check.
However, I cannot reproduce your error. In my case, sending and receiving from Chromium always works. Firefox can always hear what Chromium sends too. The only problem I noticed is that Firefox cannot transmitt. The connection is established, packets are sent and received but I cannot hear anything. However, the same behaviour I can observe when using Google Meet so this might be something on my side or Firefox side.
What browser do you use?
@mickel8 Yes, I use the FLyIpFilter in my runtime, as per:
if System.get_env("FLY_APP_NAME") do
config :solving_media, ice_ip_filter: &ExWebRTC.ICE.FlyIpFilter.ip_filter/1
end
I've tested with Chrome on windows and ubuntu.
I've not tried different server locations. I wonder if the ping time from my location could be a factor?
PS: I'm happy to create a separate public repo and fly deployment if that helps
Edit: I just tested Chrome on a mac- works fine, but windows and ubuntu still fail
@kingdomcoding so the problem only happens on Windows and Ubuntu right?
PS: I'm happy to create a separate public repo and fly deployment if that helps
Let's do that. I would like to solve that problem as working Fly.io is one of our priorities
Regarding ping. I don't think so
@kingdomcoding so the problem only happens on Windows and Ubuntu right?
It appears so. But to be clear, it works sometimes and fails sometimes, which is the issue. In addition, I've noticed the same inconsistency on mobile browsers.
PS: I'm happy to create a separate public repo and fly deployment if that helps
Let's do that. I would like to solve that problem as working Fly.io is one of our priorities
Just realized LiveBroadcaster mimics the same issue. Repo. Live
Hi, @mickel8
I wonder if there's good news on this issue, or if there's any way I can contribute to solving this
Or if there's an alternative in the elixir-webrtc/membrane world that I can explore to make progress on our app
@kingdomcoding sorry for no response :( We have some priority work to do and I didn't have time to debug this further :/ The only way is to try to analyze debug logs and try to catch the problem. Unfortunately, a deep knowledge about the ICE protocol is needed :/
You can also try to deploy your app on bare machine according to: https://hexdocs.pm/ex_webrtc/bare.html
Thanks! We'll pursue both and see what's possible
@kingdomcoding thanks! Please, keep us updated :) We will get back to this issue ASAP
Hey, @mickel8
From more log comparing, I now know that in the successful case, a conn check response is received which isn't received in the unsuccessful case. I also know that the conn check response came from the IP and port of a remote ExICE.Candidate of type srflx.
My hunch is that this inconsistent behaviour might be from the ice server the app depends on. I suspect that the app's selective connection might be caused by the (selective?) availability of the Googe ice server. Please confirm if I'm thinking in the right direction.
My attempt at resolving this was to add stun:stun.cloudflare.com:3478 to the default stun:stun.l.google.com:19302. Worked for a while, broke again, then worked again.
Are there reliable stun servers you can recommend to eliminate this and check if this is the root of the problem?
PS: I've now seen the connection fail on MacOS too
My hunch is that this inconsistent behaviour might be from the ice server the app depends on. I suspect that the app's selective connection might be caused by the (selective?) availability of the Googe ice server. Please confirm if I'm thinking in the right direction.
I don't think so. In every case, google STUN responds correctly and we are able to gather srflx candidate on the server side :(
I don't see much difference in these cases except ip addresses that are used. Might it be that your hosts are connected to different networks during tests? In particular, can the failing host be behind symmetric nat? You can check it here: https://www.checkmynat.com/
I don't think so. In every case, google STUN responds correctly and we are able to gather srflx candidate on the server side :(
I'm still building my mental model for how WebRTC and ex_webrtc work. At the moment, my understanding is that the server sends a conn check to the stun server and receives a response which is successful or not. In my logs, handle_conn_check_success_response only gets called in the successful case.
Am I missing something in my understanding?
In particular, can the failing host be behind symmetric nat? You can check it here: https://www.checkmynat.com/
The failing hosts are behind a Port Restricted Cone
First, we send stun binding req to the stun server to gather our public IP address. This operation always succeeds (look for new srflx candidate). We then send this public IP to the other side. Once we also receive some IP address es from the other side, we start performing conn checks, which sometimes fail.
What is the type od nat of successful hosts? Also, could you check whether our demos work for you?
https://elixir-webrtc.org/#demos