restate icon indicating copy to clipboard operation
restate copied to clipboard

[Bug]: Istio sidecar enabled service deployments fail with 500 unknown error

Open coder3101 opened this issue 2 months ago • 4 comments

I have a kubernetes service deployment named worker which binds two services and one workflow. The workflow is what calls out to those services as needed. I installed restate single node deployment via helm charts and registered the service endpoint of the worker with HTTP/2.

The restate lives in its own namespace restate and doesn't have istio injection enabled. However, worker is deployed in apps namespace with istio injection enabled and I see that some times the workflows are stuck with Internal 500 Unknown failure.

{"timestamp":"2025-10-06T08:55:49.530229Z","level":"WARN","fields":{"message":"Invocation error, retrying in 123ms 821µs 962ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_1iHPRZSagNZt2Aq7eKhB3QndPKJ9mbMrux","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T08:55:49.889796Z","level":"WARN","fields":{"message":"Invocation error, retrying in 220ms 772µs 436ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_1iHPRZSagNZt2Aq7eKhB3QndPKJ9mbMrux","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T08:55:50.360251Z","level":"WARN","fields":{"message":"Invocation error, retrying in 493ms 180µs 211ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_1iHPRZSagNZt2Aq7eKhB3QndPKJ9mbMrux","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T08:55:51.094533Z","level":"WARN","fields":{"message":"Invocation error, retrying in 873ms 325µs 400ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_1iHPRZSagNZt2Aq7eKhB3QndPKJ9mbMrux","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}

On the service deployment side I don't see the request nor do I find any logs in istio sidecar or else. Maybe I missed something??

After disabling sidecar injection on worker, I no longer see those error lines so, I suspect it to be related with some networking issue in this setup due to istio.

Restate: v1.5.0

coder3101 avatar Oct 06 '25 10:10 coder3101

@coder3101 are you able to invoke your services w/o the istio side car? If yes, then it sounds that the istio side car injection prevents the Restate server reaching the service deployment. What you could check is whether you can reach the service deployment from the Restate server pod. The service deployment exposes a /health endpoint which you could try to query (e.g. via curl http://worker.<namespace>.svc.<cluster-domain>:9080/health --http2-prior-knowledge).

If it is indeed an istio side car problem, then I would suggest to reach out to the istio community/documentation checking whether there is information how to set things up with istio.

tillrohrmann avatar Oct 06 '25 10:10 tillrohrmann

It does works eventually the connectivity issue seems intermittent sometimes and in some cases I found the workflow was backing off for hours before eventually progressing. I will try these steps and you know. Thanks

coder3101 avatar Oct 06 '25 10:10 coder3101

Another thing to note here is that in between service invocations, some 5/10MB of data is transferred and journaled in restate (but I don't think that it has any effect since w/o sidecar also this works apart from higher memory usage on restate).

I also ran health check from restate pod to worker couldn't see the health check fail.

I have no name!@restate-0:/$ curl -v http://worker-service.apps.svc.cluster.local/health --http2-prior-knowledge
*   Trying 10.108.76.235:80...
* Connected to worker-service.apps.svc.cluster.local (10.108.76.235) port 80 (#0)
* h2h3 [:method: GET]
* h2h3 [:path: /health]
* h2h3 [:scheme: http]
* h2h3 [:authority: worker-service.apps.svc.cluster.local]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x55966edccd20)
> GET /health HTTP/2
> Host: worker-service.apps.svc.cluster.local
> user-agent: curl/7.88.1
> accept: */*
> 
< HTTP/2 200 
< x-restate-server: restate-sdk-rust/0.7.0
< date: Mon, 06 Oct 2025 11:17:44 GMT
< x-envoy-upstream-service-time: 10
< server: istio-envoy
< x-envoy-decorator-operation: worker-service.apps.svc.cluster.local:80/*
< 
* Connection #0 to host worker-service.apps.svc.cluster.local left intact

while I could see the invocation failure in restate logs

{"timestamp":"2025-10-06T11:05:57.764785Z","level":"WARN","fields":{"message":"Invocation error, retrying in 63ms 631µs 610ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_13MWoZ6VgupV6NtiEfch7Lx5mYqkStlNMR","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T11:05:58.035224Z","level":"WARN","fields":{"message":"Invocation error, retrying in 128ms 643µs 840ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_13MWoZ6VgupV6NtiEfch7Lx5mYqkStlNMR","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T11:05:58.361175Z","level":"WARN","fields":{"message":"Invocation error, retrying in 230ms 276µs 82ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_13MWoZ6VgupV6NtiEfch7Lx5mYqkStlNMR","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}
{"timestamp":"2025-10-06T11:05:58.854790Z","level":"WARN","fields":{"message":"Invocation error, retrying in 443ms 152µs 953ns.","error":"Got a (transient) error from the service while processing invocation.\n[500 Internal] unknown.\n","restate.error.code":"RT0007","restate.invocation.id":"inv_13MWoZ6VgupV6NtiEfch7Lx5mYqkStlNMR","restate.invocation.target":"BucketController/get_input"},"target":"restate_invoker_impl"}

and here's how it eventually passed in 2s

Image

I will take this discussion into istio as well but for now disabling sidecar injection seems to fix the issue.

coder3101 avatar Oct 06 '25 11:10 coder3101

Are there any other server logs related to the warnings you are seeing? Maybe turn on the log level to debug (https://docs.restate.dev/server/monitoring/logging#log-filter) to see some more information.

tillrohrmann avatar Oct 06 '25 11:10 tillrohrmann