docs
docs copied to clipboard
Review limit error metrics & error codes more generally
See https://support.ably.io/solution/articles/3000082364-explanation-all-metrics-within-rate-limit-errors. I've created this in response to a customer hitting a max channel rate limit, see https://support.ably.io/solution/articles/3000075165-is-there-a-maximum-number-of-channels-per-connection-.
@paddybyers @SimonWoolf can one of you please:
- [ ] Confirm that the metrics I have listed are correct. I had to guess the metric IDs.
- [ ] Please review https://support.ably.io/solution/folders/3000012310 to see if we have articles for the error codes relating to common rate limit issues. If there are any missing, please post them here and we can ask @tomczoink to help.
Confirm that the metrics I have listed are correct
-
channels.maxRateshould bechannel.maxRate(as within a single channel) -
For acct-wide limits, currently the message you get when you try to publish doesn't include a metric, it just says eg
Maximum account-wide instantaneous messages rate exceeded, that seemed friendlier. But I can easily make itMaximum account-wide instantaneous rate limit exceeded; metric = messages.maxRateor something, if you'd prefer people to have a metric they can look up. -
reactor limits are missing: 'reactor.httpEvent.maxRate', 'reactor.amqp.maxRate' (assuming I make the change above)
-
For example, you have hit an instantaneous per second rate limit, then we will only reject new published messages for that second, and the block will be removed in the following second interval-- this isn't really correct. For things like per-channel and per-connection limits, the block interval is 6 seconds long. For account-wide rate limits there isn't a block interval, instead suppression is done on a rolling probabilistic basis (see 'instantaneous limits' section of https://support.ably.io/support/solutions/articles/3000079684-understanding-account-limit-notifications-within-email-alerts-or-in-your-dashboard )
Thanks @SimonWoolf
Yeh, I was being a bit lazy with my per second rate limit description, also trying to keep it simple to understand, but it's wrong so I will update and be less lazy
@SimonWoolf re: my request for common error codes, just got this from a customer 30 minutes ago:
Thanks for contacting, we’ve took a look at the current situation. https://www.ably.io/accounts/1391/notifications states we’ve hit the apiRequests limit, not the messages limit. Also, the error code given is not documented: https://help.ably.io/error/40115
Would it be possible to please get that list of the most common error codes so that we can create appropriate solution articles?
Would it be possible to please get that list of the most common error codes so that we can create appropriate solution articles?
what list of most common error codes?
See task above in this issue:
Please review https://support.ably.io/solution/folders/3000012310 to see if we have articles for the error codes relating to common rate limit issues. If there are any missing, please post them here and we can ask @tomczoink to help.
...So now I'm looking at this, they're not very consistent. Per-connection publish rate limit is 42911 (or 42921 if fatal), and acct-wide rate limits are also 42911, but most other rate limits (e.g. per-channel) are 42910.
ISTM we should either just use the same code for all rate limits (well, two, one fatal and one nonfatal), or have a different code for every different rate limit. WDYT? @mattheworiordan @paddybyers
It depends on whether or not we want to be able to link to different help docs in those cases. Or perhaps use a limited number of codes, and exploit the functionality of being able to construct an href base on information beyond just the code.
If we do go for a different code in each case, then it should be a whole new family starting at 10000 or something.
Well I think it's better to have unique error codes for logical grouping of limits so that we can write simple articles to address that problem. For example, how to address message.maxRate problems (fan-out) is very different to address tokenRequest hourly limits. Equally, having different codes for all limits could be hard.
Please can you suggest some natural groupings and we'll get @tomczoink to write up articles we can build on.
OK, so lets have completely new series of codes that get sent along with a 429 statusCode for different rate limits.
10000-11999 - rate limit errors 10000-10999 - hard limit errors - ie the attempted operation was rejected or modified as a result of the rate limit 11000-11999 - warning limit codes - the attempted operation was permitted, but the current usage is close to hitting a hard limit.
The individual codes will be consistent in the 10xxx and 11xxx series.
Then lets break it down by functional area:
100xx: generic/unspecified 1001x: instantaneous limit 1002x: hourly/monthly limit 1003x: payload/request size limit
101xx: message limits 1011x: instantaneous limi 1022x: hourly/monthly limit 1023x: size limit
102xx: connection limits 1021x: instantaneous limit 1023x: other limit (eg max attachments)
103xx: channel limits 1031x: instantaneous limit 1033x: other limit (eg max presence)
104xx: request (api request/ token request) limits 1041x: instantaneous limit 1042x: hourly/monthly limit 1043x: size limit
105xx: reactor limits 1051x: instantaneous limit 1052x: hourly/monthly limit 1053x: size limit
I like this. Would it be possible to agree on keeping the error codes in one place (a bit like what we did with https://github.com/ably/ably-ruby/pull/171/commits/7df0916b3759f7ab9067cb2ccdfa982112bac970) so that once implemented, we could ask @tomczoink to create pages for each error code? I realise it's a lot of support articles, but they will be quite generic with slight variations and links to larger articles.
Would it be possible to agree on keeping the error codes in one place
You mean like in ably-common as now, or something different?
Would it be possible to agree on keeping the error codes in one place (a bit like what we did with ably/ably-ruby@7df0916)
Well yes. I was thinking about how realtime handles it, but of course that's not really all that important if we update ably-common. So ignore me if it's going to go into ably-common.
One concern, anyone who is catching errors now that relate to rate limits, and backing off requests, will now have broken behaviours. We could email every customer, but it's not ideal...
One concern, anyone who is catching errors now that relate to rate limits, and backing off requests, will now have broken behaviours. We could email every customer, but it's not ideal...
The statusCode will be unchanged, so if they really had written their error-handling code correctly, they'll react in some appropriate manner when they get that statusCode even if the specific code isn't recognised.
I don't think that issue should stop us from doing it properly.
The statusCode will be unchanged, so if they really had written their error-handling code correctly, they'll react in some appropriate manner when they get that statusCode even if the specific code isn't recognised.
Ok, seems fair enough. What status code do we unilaterally use for rate limiting, and for what other operations will we see that error code? We should write a support article on backing off if you hit a rate limit with some sample code as an example. So I need that info first.
429 - too many requests. We only return this in the case of rate-limiting.
We ought to check whether or not this can also be returned by the router or ELB in situations where a client might want to react differently. Even so, in these cases they will get either an empty code, or a generic 42900 code.
See https://ably-real-time.slack.com/archives/C030APSH3/p1541097699184400
A somewhat related discussion on assumed meaning of error codes based on existing errors. Whoever is writing the support article has on way of knowing if an error code could have additional meanings, so will often write an article focussed on the problem they are aware, unaware that same error code may have a completely different meaning.
@mattheworiordan happy for this to be closed now we've got a collection of error codes + articles for any popular ones?