webpa-common icon indicating copy to clipboard operation
webpa-common copied to clipboard

SCYTALE: fanout operation canceled or timed out

Open krshna-dtdl opened this issue 1 year ago • 6 comments

What could be the possible reason for this error?

logger.Error("fanout operation canceled or timed out", zap.Int("statusCode", http.StatusGatewayTimeout), zap.Any("url", original.URL), zap.Error(fanoutCtx.Err()))

We are getting this intermittently at SCYTALE and are not sure of the reason.

https://github.com/xmidt-org/webpa-common/blob/cdd32a26087537693accfe44520e6d7995391b86/xhttp/fanout/handler.go#L328

krshna-dtdl avatar Apr 25 '24 06:04 krshna-dtdl

that error indicates the operation to talk to a set of talarias has either been canceled by scytale or a subset of talarias are taking too long response leading to a timed out

denopink avatar Apr 25 '24 11:04 denopink

So is it talaria to scytale delay or Router to talaria delay?

Also how can we overcome to this problem? @denopink

krshna-dtdl avatar Apr 25 '24 11:04 krshna-dtdl

So is it talaria to scytale delay or Router to talaria delay?

Talaria's response to the request from scytales is taking too long.

Also how can we overcome to this problem? @denopink

As you already know, there isn't a magic solution to this because it depends on your infra and how your xmidt cluster is setup.

assuming your infra and xmidt cluster are setup correctly, 1 easy place to start may be increasing your talaria compute (vm size or total talaria count) in case your current talaria(s) are overwhelmed by the current load from connected devices and incoming requests from scytale(s) and other services.

closing this ticket for now because xmidt cluster tuning is out of this ticket's scope.

feel free to open up another ticket if you encounter any new issues relating to this.

denopink avatar Apr 25 '24 12:04 denopink

Hello @denopink We don't see any spike in terms of resources in talaria and scytale and the message is being received by talaria, as a next step of action we are checking where it is taking time to respond due to which we are getting this error.

Sachin4403 avatar Apr 29 '24 07:04 Sachin4403

@Sachin4403 continue the conversation from #1046

recap:

I suggested the following:

A timeout means talaria's response to the request from scytales is taking too long.

Have you confirmed talaria is receiving the request from scytale? If not, try running your talaria server with a DEBUG logging level.

If you're seeing logs indicating talaria is receiving the requests from scytale, then feel free to post the logs here.

But, if you don't see any logs indicating talaria is receiving the requests from scytale, then something in your stack is causing the timeout before it reaches talaria. At that point, I would suggest reaching out to your infra team/provider or ops running the testing environment for help figuring out the issue on your end.

You mentioned you were going to post some logs: @denopin

k Yes the request went to talaria

At talaria we are getting: Could not process device request

context canceled

https://github.com/xmidt-org/talaria/blob/c122b182c6f9a27783908d1d48f106469982b5ce/WRPHandler.go#L77

Will post more logs soon

I just read your latest comment.

I would also recommend you to check your device (what's connected to talaria) logs and whether your device is taking too long to respond back to talaria.

Feel free to post you talaria and test device logs here, I may be able to give you better insight using those logs.

denopink avatar Apr 29 '24 15:04 denopink

@Sachin4403 is this still an issue? Otherwise, I'm going to close this. 🙂

denopink avatar May 11 '24 20:05 denopink

Hello @denopink

Feel free to close this, as we see the WRP Requests are going to the device, Now as a next step we are checking in device what could be root cause of it.

Sachin4403 avatar May 13 '24 06:05 Sachin4403

Sounds good, best of luck.

denopink avatar May 13 '24 10:05 denopink