StackExchange.Redis Proposal: Improved Timeout (and maybe others) Exception Help Page(s)

Right now we an error page linked on timeout errors that points to https://stackexchange.github.io/StackExchange.Redis/Timeouts

There is 1 issue with this: it breaks if the repository ever moves, but we can solve that with a custom domain for GitHub pages.

But there's another issue...it could be way better. Something we did at Stack Overflow with logging infrastructure was to capture the error details into JSON for any error that occurred (via StackExchange.Exceptional). This isn't specific to that library though, the Exception.Data connection on the root Exception object in .NET is populated with this data. It's all prefixed with Redis-. Let's take an example error message:

Timeout performing UNLINK (1000ms), next: GET MYKEY, inst: 13, qu: 0, qs: 39, aw: False, rs: ReadAsync, ws: Idle, in: 255, in-pipe: 0, out-pipe: 24068, serverEndpoint: myserver.local:18477, mc: 1/1/0, mgr: 10 of 10 available, clientName: DUUB-SRV-04, IOCP: (Busy=265,Free=735,Min=30,Max=1000), WORKER: (Busy=10,Free=32757,Min=30,Max=32767), v: 2.2.50.36290 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

The timeout doc explains what these compact labels like inst (for exception message length) are. The data keys in the exception are more details (you can see these in code today) - for example inst is "OpsSinceLastHeartbeat" and qu is "Queue-Awaiting-Write". Most users don't get these names, and even if they did they wouldn't be helpful to nearly as many people as they could be.

What if we did better?

My idea is having a link in that error message. Something like https://somedomain.tld/errors/timeout#{"OpsSinceLastHeartbeat":13,"Queue-Awaiting-Write":0...}, where we wouldn't even send the data to the server, it'd be browser hash based (your endpoint need not be included, though knowing if it's a cloud provider and such could be useful). If we had this, JavaScript in the page could make use of it (all via GitHub pages in this same repo, so we don't drift).

Let's take the exception above, we could for each stat give a description and red/green hint of "that's good" or "here's why that's might be a problem...". For example in that message we see a few things: we have a lot of data in the outbound pipe and we could point that out to the user, and even more importantly look at those 265 threads. That's a lot going on! The app server is likely overloaded and that'll easily lead to timeouts (not handling things in time). We could show an entire section about this, why it happens, common things to check for (e.g. sync over async), point to some resources on using async, etc.

Overall: We have a big timeout page today that isn't read as much as we'd like but also it's a lot of info. What if we distilled it down and showed you the most relevant pieces based on the bits of data we already have available in the exception message? From my view, that's what we're already doing when a user files an issue...but it's typically @mgravell or I or a few select others parsing that data and advising. A lot of it could be automated which would both get users answers quicker (without even filing an issue, often) and in tern let us focus on better improvements for everyone. We'd be adding scaling the data -> advice pipeline for so many common cases we see.

What do others think about such a URL with JSON (or bas64 if needed) being replacing the existing timeout URL in error messages, to give you more tailored advice?

Oct 21 '21 15:10 NickCraver

The idea seems good to me.

I can see someone not following the url because of fear of giving out personal data or the likes so maybe put them both, something like:

For more info on this problem:

general: https://somedomain.tld/errors/timeout

your: https://somedomain.tld/errors/timeout#{"OpsSinceLastHeartbeat":13,"Queue-Awaiting-Write":0...}

or something like that, so they can choose.

The base64 part makes sense and would probably be cleaner from a data transfer perspective but again, even if we are only talking about devs, considering the huge variety there is for some may feel more obscure in what data they would be sending. But still, maybe something to consider.

Oct 21 '21 17:10 jodydonetti

ps: I just read your "You're literally running the library code in your app, are you not trusting it at that point? It's also not passing sensitive data in either case" tweet and on one hand it makes sense, while on the other people may think about some precedents regarding unwanted "diagnostics collection" happened in the past or similar, even if in totally unrelated scenarios.

Btw, for me personally, I would click on that link for sure.

Oct 21 '21 17:10 jodydonetti

Totally agree. The timeouts page has gotten kinda big and overwhelming for some users (myself included). Would be great if it could point us to the problematic areas and suggested solutions automatically.

Oct 21 '21 20:10 mendoncaftw

But you understand why you got the timeout?

Nov 02 '21 07:11 Ohad29

But you understand why you got the timeout?

NO :) it is still not clear who is timeouting. It is my VM and my code problem OR Redis instance performance problem? A lot of information in timeout message and it is useless

Nov 19 '21 12:11 dariusdev