avro
avro copied to clipboard
avro: don't cache temporary errors
Currently if we get an error from the registry, we cache it for all time. We could instead inspect the error and cache the result only if it's not marked as temporary.
We are using Avro with registry in several Go micro services in production. We have been hit by this bug several times, and the result is that we se a repeated error message in our log each time a POD tries to consume a message from a topic. The POD will never recover, since the error is cached forever. I've tried to fix this problem with this PR:
https://github.com/heetch/avro/pull/127
The idea here is to cache the error for one minute in order to keep the registry from being overloaded (which could be the cause of the problem in the first place), but to limit the cache to not being longer than one minute for errors relating to getting the schema. This will let the POD recover once the problem has been fixed (network issues, temporarily missing schema etc.).
I have kept eternal cache of schema decoding errors, since I can't think of any cases where they could be fixed without either upgrading the avro decoder or creating a new schema.
We are currently running this PR in all our Avro Go services.