ditto icon indicating copy to clipboard operation
ditto copied to clipboard

Helm deployment throws 500 errors after AKS update

Open dhcode opened this issue 1 year ago • 3 comments

We are using the helm chart v3.5.4 of eclipse/ditto for our deployment on an Azure Kubernetes cluster. Each ditto service we use (policies, things, thingsSearch, gateway) has 2 instances, and a pod disruption budget of 1, so no service is ever gone completely.

Almost every time there is an update of the AKS and the nodes get recreated one after another, ditto does not correctly answer the requests anymore and returns status 500.

In the logs we see errors like this:

Received DittoRuntimeException during enforcement or forwarding to target actor, telling sender: DittoInternalErrorException [message='There was a rare case of an unexpected internal error.', errorCode=internalerror, httpStatus=HttpStatus [code=500, category=SERVER_ERROR], description='Please contact the service team or your administrator.'

To fix it, we scale all deployments of the ditto services down and up again. Then it works again.

But I would expect that ditto heals itself when pods are removed and added.

Is there a setting to improve this behavior or do others have that issue, too?

dhcode avatar Jun 04 '24 09:06 dhcode

We thought it gets better after the issue was solved: https://github.com/eclipse-ditto/ditto/issues/1839 But it still appears.

dhcode avatar Jun 04 '24 09:06 dhcode

@dhcode Hi, long time not seen :)

Are you sure that the k8s pods are shutdown gracefully so that the JVM received a SIGTERM and can do a proper cluster shutdown?

E.g. the things service logs (on service update when the "old" version is stopped):

INFO Initiated coordinated shutdown; gracefully shutting down ...
...
(30-60 seconds later)
INFO Graceful shutdown completed.

On EKS I once had the issue that a k8s network policy was configured which immediately stopped all traffic to/from pods once they were stopped - so the leaving cluster node never got the chance to communicate to other cluster nodes that it now "gracefully leaves" the cluster. That lead to similar situations which you described - and often in an inconsistent cluster state.

I have no recent experience with AKS - but on AWS EKS our Ditto updates currently run completely smooth (using the Helm chart), even with load on the cluster.

thjaeckle avatar Jun 04 '24 11:06 thjaeckle

Thanks for the hint. I checked the logs and I find Initiated coordinated shutdown gracefully shutting down ...

Sometimes it does not show the Graceful shutdown completed.

But the last recorded message seen about 5 seconds after the shutdown was started: thing: [68] of the entities in shard [0] not stopped after [5 seconds]. Maybe the handOffStopMessage [org.eclipse.ditto.internal.utils.cluster.StopShardedActor] is not handled?

So it seems like the shutdown is not always graceful.

dhcode avatar Jun 04 '24 14:06 dhcode