livekit
livekit copied to clipboard
LiveKit runs into timeouts after a Redis restart
Describe the bug
Following the update from Livekit v1.5.1 to v1.5.2, we're seeing LiveKit timeouts after a Redis restart. We run a three node Redis Sentinel setup in GKE where sometime the Redis nodes restart and trigger a failover. Sentinel quickly elects a new leader but somehow LiveKit is broken after this. Our non-production environments run on Spot VMs and hence we have frequent Redis failovers causing the problem to surface on a regular basis.
Server
- Version:
v1.5.2 - Environment: stage (often), production (a few times), both on GKE
- Redis Sentinel (we can share the deployment YAMLs if needed)
To Reproduce Steps to reproduce the behavior:
- Restart Redis
- Do NOT restart LiveKit
- Start a room and add participants
Expected behavior LiveKit correctly connects to the new Redis master following a failover and continues to work.
Logs
After the Redis failover I can immediately see this error in the LiveKit logs:
redis: 2024/02/08 02:29:42 pubsub.go:168: redis: discarding bad PubSub connection: EOF
A few hours later when someone tries to use a room, it looks like this:
{
"roomID": "RM_XunAaDQd6iJd",
"ts": 1707379375.325662,
"room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
"level": "info",
"selectedNodeID": "ND_DMp9k8o6H5Hc",
"msg": "selected node for room",
"caller": "service/roomallocator.go:138",
"logger": "livekit"
}
{
"insertId": "5ofmrytnt5twkxxt",
"jsonPayload": {
"participant": "",
"error": "request timed out",
"caller": "routing/localrouter.go:107",
"ts": 1707379378.3283205,
"connID": "CO_Qnr3ddALsLFw",
"logger": "livekit",
"room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
"msg": "could not handle new participant",
"stacktrace": "github.com/livekit/livekit-server/pkg/routing.(*LocalRouter).StartParticipantSignalWithNodeID\n\t/workspace/pkg/routing/localrouter.go:107\ngithub.com/livekit/livekit-server/pkg/routing.(*RedisRouter).StartParticipantSignal\n\t/workspace/pkg/routing/redisrouter.go:167\ngithub.com/livekit/livekit-server/pkg/service.(*RoomService).CreateRoom\n\t/workspace/pkg/service/roomservice.go:96\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoomProtobuf.func2\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1477\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoomProtobuf\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1478\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoom\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1339\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).ServeHTTP\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1290\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2514\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2136\ngithub.com/livekit/livekit-server/pkg/service.(*APIKeyAuthMiddleware).ServeHTTP\n\t/workspace/pkg/service/auth.go:99\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/rs/cors.(*Cors).ServeHTTP\n\t/go/pkg/mod/github.com/rs/[email protected]/cors.go:316\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2938\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:2009",
"level": "error"
}
{
"insertId": "4r6lf1ntp06ervkq",
"jsonPayload": {
"service": "RoomService",
"method": "CreateRoom",
"msg": "API RoomService.CreateRoom",
"logger": "livekit.api",
"caller": "service/twirp.go:124",
"level": "info",
"request": "name:\"2d85f134-4b18-4ad3-a2c1-2b56262d5761\"",
"code": "deadline_exceeded",
"status": "408",
"room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
"ts": 1707379378.3284426,
"error": "request timed out",
"duration": 3.006614456
}
This seems to be the same issue as #2440.
BTW: You can ping me in the LiveKit Community Slack. I'm happy to share more logs and setup details.
Thanks for the report. I think this is fixed by https://github.com/livekit/psrpc/pull/80. this will be released in 1.5.3
Hi @davidzhao thanks,
I have tested the fix, actually it seems to resolve our issue.
Since it's going to be released in 1.5.3
Do you have an idea when the version will be available ?
Regards
I'm actually facing the same problem.
@davidzhao, could you specify which versions of Ingress and Egress address this issue? It seems that these services are also unable to reconnect after a Redis restart.
@kannonski @cruizba the fixes have been released in 1.5.3. Do you want to give it a try and see if that resolves your issues?
We (@kannonski and me) can no longer reproduce the issue after the update to 1.5.3.
Thanks for the confirmation @gysel !