docs icon indicating copy to clipboard operation
docs copied to clipboard

Review limit error metrics & error codes more generally

Open mattheworiordan opened this issue 7 years ago • 18 comments
trafficstars

See https://support.ably.io/solution/articles/3000082364-explanation-all-metrics-within-rate-limit-errors. I've created this in response to a customer hitting a max channel rate limit, see https://support.ably.io/solution/articles/3000075165-is-there-a-maximum-number-of-channels-per-connection-.

@paddybyers @SimonWoolf can one of you please:

  • [ ] Confirm that the metrics I have listed are correct. I had to guess the metric IDs.
  • [ ] Please review https://support.ably.io/solution/folders/3000012310 to see if we have articles for the error codes relating to common rate limit issues. If there are any missing, please post them here and we can ask @tomczoink to help.

┆Issue is synchronized with this Jira Task by Unito

mattheworiordan avatar Sep 22 '18 13:09 mattheworiordan

Confirm that the metrics I have listed are correct

  • channels.maxRate should be channel.maxRate (as within a single channel)

  • For acct-wide limits, currently the message you get when you try to publish doesn't include a metric, it just says eg Maximum account-wide instantaneous messages rate exceeded, that seemed friendlier. But I can easily make it Maximum account-wide instantaneous rate limit exceeded; metric = messages.maxRate or something, if you'd prefer people to have a metric they can look up.

  • reactor limits are missing: 'reactor.httpEvent.maxRate', 'reactor.amqp.maxRate' (assuming I make the change above)

  • For example, you have hit an instantaneous per second rate limit, then we will only reject new published messages for that second, and the block will be removed in the following second interval -- this isn't really correct. For things like per-channel and per-connection limits, the block interval is 6 seconds long. For account-wide rate limits there isn't a block interval, instead suppression is done on a rolling probabilistic basis (see 'instantaneous limits' section of https://support.ably.io/support/solutions/articles/3000079684-understanding-account-limit-notifications-within-email-alerts-or-in-your-dashboard )

SimonWoolf avatar Sep 26 '18 08:09 SimonWoolf

Thanks @SimonWoolf

Yeh, I was being a bit lazy with my per second rate limit description, also trying to keep it simple to understand, but it's wrong so I will update and be less lazy

mattheworiordan avatar Sep 26 '18 10:09 mattheworiordan

@SimonWoolf re: my request for common error codes, just got this from a customer 30 minutes ago:

Thanks for contacting, we’ve took a look at the current situation. https://www.ably.io/accounts/1391/notifications states we’ve hit the apiRequests limit, not the messages limit. Also, the error code given is not documented: https://help.ably.io/error/40115

Would it be possible to please get that list of the most common error codes so that we can create appropriate solution articles?

mattheworiordan avatar Sep 26 '18 12:09 mattheworiordan

Would it be possible to please get that list of the most common error codes so that we can create appropriate solution articles?

what list of most common error codes?

SimonWoolf avatar Sep 26 '18 12:09 SimonWoolf

See task above in this issue:

Please review https://support.ably.io/solution/folders/3000012310 to see if we have articles for the error codes relating to common rate limit issues. If there are any missing, please post them here and we can ask @tomczoink to help.

mattheworiordan avatar Sep 26 '18 14:09 mattheworiordan

...So now I'm looking at this, they're not very consistent. Per-connection publish rate limit is 42911 (or 42921 if fatal), and acct-wide rate limits are also 42911, but most other rate limits (e.g. per-channel) are 42910.

ISTM we should either just use the same code for all rate limits (well, two, one fatal and one nonfatal), or have a different code for every different rate limit. WDYT? @mattheworiordan @paddybyers

SimonWoolf avatar Sep 27 '18 13:09 SimonWoolf

It depends on whether or not we want to be able to link to different help docs in those cases. Or perhaps use a limited number of codes, and exploit the functionality of being able to construct an href base on information beyond just the code.

If we do go for a different code in each case, then it should be a whole new family starting at 10000 or something.

paddybyers avatar Sep 27 '18 17:09 paddybyers

Well I think it's better to have unique error codes for logical grouping of limits so that we can write simple articles to address that problem. For example, how to address message.maxRate problems (fan-out) is very different to address tokenRequest hourly limits. Equally, having different codes for all limits could be hard.

Please can you suggest some natural groupings and we'll get @tomczoink to write up articles we can build on.

mattheworiordan avatar Oct 02 '18 01:10 mattheworiordan

OK, so lets have completely new series of codes that get sent along with a 429 statusCode for different rate limits.

10000-11999 - rate limit errors 10000-10999 - hard limit errors - ie the attempted operation was rejected or modified as a result of the rate limit 11000-11999 - warning limit codes - the attempted operation was permitted, but the current usage is close to hitting a hard limit.

The individual codes will be consistent in the 10xxx and 11xxx series.

Then lets break it down by functional area:

100xx: generic/unspecified 1001x: instantaneous limit 1002x: hourly/monthly limit 1003x: payload/request size limit

101xx: message limits 1011x: instantaneous limi 1022x: hourly/monthly limit 1023x: size limit

102xx: connection limits 1021x: instantaneous limit 1023x: other limit (eg max attachments)

103xx: channel limits 1031x: instantaneous limit 1033x: other limit (eg max presence)

104xx: request (api request/ token request) limits 1041x: instantaneous limit 1042x: hourly/monthly limit 1043x: size limit

105xx: reactor limits 1051x: instantaneous limit 1052x: hourly/monthly limit 1053x: size limit

paddybyers avatar Oct 02 '18 09:10 paddybyers

I like this. Would it be possible to agree on keeping the error codes in one place (a bit like what we did with https://github.com/ably/ably-ruby/pull/171/commits/7df0916b3759f7ab9067cb2ccdfa982112bac970) so that once implemented, we could ask @tomczoink to create pages for each error code? I realise it's a lot of support articles, but they will be quite generic with slight variations and links to larger articles.

mattheworiordan avatar Oct 02 '18 10:10 mattheworiordan

Would it be possible to agree on keeping the error codes in one place

You mean like in ably-common as now, or something different?

paddybyers avatar Oct 02 '18 10:10 paddybyers

Would it be possible to agree on keeping the error codes in one place (a bit like what we did with ably/ably-ruby@7df0916)

Well yes. I was thinking about how realtime handles it, but of course that's not really all that important if we update ably-common. So ignore me if it's going to go into ably-common.

mattheworiordan avatar Oct 02 '18 10:10 mattheworiordan

One concern, anyone who is catching errors now that relate to rate limits, and backing off requests, will now have broken behaviours. We could email every customer, but it's not ideal...

mattheworiordan avatar Oct 02 '18 10:10 mattheworiordan

One concern, anyone who is catching errors now that relate to rate limits, and backing off requests, will now have broken behaviours. We could email every customer, but it's not ideal...

The statusCode will be unchanged, so if they really had written their error-handling code correctly, they'll react in some appropriate manner when they get that statusCode even if the specific code isn't recognised.

I don't think that issue should stop us from doing it properly.

paddybyers avatar Oct 02 '18 10:10 paddybyers

The statusCode will be unchanged, so if they really had written their error-handling code correctly, they'll react in some appropriate manner when they get that statusCode even if the specific code isn't recognised.

Ok, seems fair enough. What status code do we unilaterally use for rate limiting, and for what other operations will we see that error code? We should write a support article on backing off if you hit a rate limit with some sample code as an example. So I need that info first.

mattheworiordan avatar Oct 02 '18 11:10 mattheworiordan

429 - too many requests. We only return this in the case of rate-limiting.

We ought to check whether or not this can also be returned by the router or ELB in situations where a client might want to react differently. Even so, in these cases they will get either an empty code, or a generic 42900 code.

paddybyers avatar Oct 02 '18 12:10 paddybyers

See https://ably-real-time.slack.com/archives/C030APSH3/p1541097699184400

A somewhat related discussion on assumed meaning of error codes based on existing errors. Whoever is writing the support article has on way of knowing if an error code could have additional meanings, so will often write an article focussed on the problem they are aware, unaware that same error code may have a completely different meaning.

mattheworiordan avatar Nov 01 '18 18:11 mattheworiordan

@mattheworiordan happy for this to be closed now we've got a collection of error codes + articles for any popular ones?

tomczoink avatar Jan 21 '20 11:01 tomczoink