gokini icon indicating copy to clipboard operation
gokini copied to clipboard

Shards with no live consumers

Open relistan opened this issue 4 years ago • 5 comments

We're running Gokini with the Benthos integration you wrote (kinesis_balanced). Most of the time everything works fine. After awhile, however, we keep ending up with shards that are not being consumed. It appears to be something in the GetLease() code. We're still trying to track down the issue, but I wanted to get something on the project radar.

At first we thought it was a problem of more than one consumer process stepping on each other's toes, but it happens even with only a single consumer.

The symptom is shards in DynamoDB marked as Closed: false but where the LeaseTimeout is quite awhile in the past. See attached screenshot.

Screen Shot 2020-08-19 at 9 54 43 AM

relistan avatar Aug 19 '20 12:08 relistan

Can you please export the dynamodb table and send it to me?

If you restart benthos does it fix the problem or do the shards remain locked but not processing?

patrobinson avatar Aug 20 '20 06:08 patrobinson

@patrobinson Sorry, I can't export the table since this is for real and we had to get it back up and running. Yes, restarting it works. This has happened a few times, so it's not a one-off. The only remaining columns were the sequence IDs if I recall properly, and I also screenshotted those at the time:

Screen Shot 2020-08-19 at 9 54 49 AM

relistan avatar Aug 20 '20 09:08 relistan

Hi @relistan

Have you observed any errors in the logs such as Error renewing lease?

I've got some time now I can try and replicate this myself

patrobinson avatar Aug 30 '21 12:08 patrobinson

reading through the code I think this could happen if GetRecords returns an unrecoverable error.

https://github.com/patrobinson/gokini/blob/master/consumer.go#L364-L376

At which point I think we should panic rather than leave ourselves in a bad state.

I've released a beta version with this fix https://github.com/patrobinson/gokini/releases/tag/v0.2.0-beta and I'll give it a whirl later this week

patrobinson avatar Aug 30 '21 12:08 patrobinson

@patrobinson we stopped running it shortly after I opened this. We switched to the other Benthos Kinesis consumer and hard-pinned worker to shards. That's less than ideal but worked. So, unfortunately, I have no better info for you.

relistan avatar Aug 30 '21 13:08 relistan