cortex
cortex copied to clipboard
Ring state will be inconsistent between memory and consul after a CAS error
The change in memory state is made before updating Consul, and no attempt is made to revert the former if the latter fails:
https://github.com/cortexproject/cortex/blob/a87c25fd994a2eec5ab5af0a920cc529b17c0030/pkg/ring/lifecycler.go#L710-L711
I noticed this because I got this log message:
level=warn ts=2020-09-09T19:59:32.324235593Z caller=grpc_logging.go:55 duration=15.010918473s method=/cortex.Ingester/TransferChunks err="Transfer: ChangeState: failed to CAS collectors/ring" msg="gRPC\n"
That's coming from here: https://github.com/cortexproject/cortex/blob/f27cef893d92e38395de0504f922231fc15bb7d8/pkg/ingester/transfer.go#L204
The defer in that function should then log "TransferChunks failed" and go back to PENDING state, but I don't see that log, which is explained by this line checking the in-memory state: https://github.com/cortexproject/cortex/blob/f27cef893d92e38395de0504f922231fc15bb7d8/pkg/ingester/transfer.go#L185
(Also odd: metrics show it did go to ACTIVE state)
@pstibrany explained the last point: the next heartbeat will save the in-memory state to Consul.
So, maybe all we need is a better check in Ingester.transfer() ?
What do you think should happen? How do you suggest to modify check in Ingester.transfer()?
Changing state doesn't seem appropriate. As of now the only possible transition from ACTIVE state is to LEAVING state. I don't think that's correct answer.
Other possibilities seem even worse (going back to PENDING/JOINING), because transfer has already finished.