livekit icon indicating copy to clipboard operation
livekit copied to clipboard

LiveKit runs into timeouts after a Redis restart

Open gysel opened this issue 1 year ago • 3 comments

Describe the bug

Following the update from Livekit v1.5.1 to v1.5.2, we're seeing LiveKit timeouts after a Redis restart. We run a three node Redis Sentinel setup in GKE where sometime the Redis nodes restart and trigger a failover. Sentinel quickly elects a new leader but somehow LiveKit is broken after this. Our non-production environments run on Spot VMs and hence we have frequent Redis failovers causing the problem to surface on a regular basis.

Server

  • Version: v1.5.2
  • Environment: stage (often), production (a few times), both on GKE
  • Redis Sentinel (we can share the deployment YAMLs if needed)

To Reproduce Steps to reproduce the behavior:

  1. Restart Redis
  2. Do NOT restart LiveKit
  3. Start a room and add participants

Expected behavior LiveKit correctly connects to the new Redis master following a failover and continues to work.

Logs

After the Redis failover I can immediately see this error in the LiveKit logs:

redis: 2024/02/08 02:29:42 pubsub.go:168: redis: discarding bad PubSub connection: EOF

A few hours later when someone tries to use a room, it looks like this:

{
      "roomID": "RM_XunAaDQd6iJd",
      "ts": 1707379375.325662,
      "room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
      "level": "info",
      "selectedNodeID": "ND_DMp9k8o6H5Hc",
      "msg": "selected node for room",
      "caller": "service/roomallocator.go:138",
      "logger": "livekit"
}
{
    "insertId": "5ofmrytnt5twkxxt",
    "jsonPayload": {
      "participant": "",
      "error": "request timed out",
      "caller": "routing/localrouter.go:107",
      "ts": 1707379378.3283205,
      "connID": "CO_Qnr3ddALsLFw",
      "logger": "livekit",
      "room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
      "msg": "could not handle new participant",
      "stacktrace": "github.com/livekit/livekit-server/pkg/routing.(*LocalRouter).StartParticipantSignalWithNodeID\n\t/workspace/pkg/routing/localrouter.go:107\ngithub.com/livekit/livekit-server/pkg/routing.(*RedisRouter).StartParticipantSignal\n\t/workspace/pkg/routing/redisrouter.go:167\ngithub.com/livekit/livekit-server/pkg/service.(*RoomService).CreateRoom\n\t/workspace/pkg/service/roomservice.go:96\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoomProtobuf.func2\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1477\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoomProtobuf\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1478\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).serveCreateRoom\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1339\ngithub.com/livekit/protocol/livekit.(*roomServiceServer).ServeHTTP\n\t/go/pkg/mod/github.com/livekit/[email protected]/livekit/livekit_room.twirp.go:1290\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2514\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2136\ngithub.com/livekit/livekit-server/pkg/service.(*APIKeyAuthMiddleware).ServeHTTP\n\t/workspace/pkg/service/auth.go:99\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/rs/cors.(*Cors).ServeHTTP\n\t/go/pkg/mod/github.com/rs/[email protected]/cors.go:316\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\t/go/pkg/mod/github.com/urfave/negroni/[email protected]/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2938\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:2009",
      "level": "error"
}
{
    "insertId": "4r6lf1ntp06ervkq",
    "jsonPayload": {
      "service": "RoomService",
      "method": "CreateRoom",
      "msg": "API RoomService.CreateRoom",
      "logger": "livekit.api",
      "caller": "service/twirp.go:124",
      "level": "info",
      "request": "name:\"2d85f134-4b18-4ad3-a2c1-2b56262d5761\"",
      "code": "deadline_exceeded",
      "status": "408",
      "room": "2d85f134-4b18-4ad3-a2c1-2b56262d5761",
      "ts": 1707379378.3284426,
      "error": "request timed out",
      "duration": 3.006614456
}

gysel avatar Feb 08 '24 10:02 gysel

This seems to be the same issue as #2440.

BTW: You can ping me in the LiveKit Community Slack. I'm happy to share more logs and setup details.

gysel avatar Feb 08 '24 10:02 gysel

Thanks for the report. I think this is fixed by https://github.com/livekit/psrpc/pull/80. this will be released in 1.5.3

davidzhao avatar Feb 10 '24 07:02 davidzhao

Hi @davidzhao thanks,
I have tested the fix, actually it seems to resolve our issue.
Since it's going to be released in 1.5.3
Do you have an idea when the version will be available ?

Regards

kannonski avatar Feb 12 '24 16:02 kannonski

I'm actually facing the same problem.

@davidzhao, could you specify which versions of Ingress and Egress address this issue? It seems that these services are also unable to reconnect after a Redis restart.

cruizba avatar Feb 22 '24 19:02 cruizba

@kannonski @cruizba the fixes have been released in 1.5.3. Do you want to give it a try and see if that resolves your issues?

davidzhao avatar Feb 23 '24 06:02 davidzhao

We (@kannonski and me) can no longer reproduce the issue after the update to 1.5.3.

gysel avatar Feb 27 '24 09:02 gysel

Thanks for the confirmation @gysel !

davidzhao avatar Feb 27 '24 19:02 davidzhao