Exceptionless icon indicating copy to clipboard operation
Exceptionless copied to clipboard

Smarter throttling

Open romanych opened this issue 10 years ago • 25 comments
trafficstars

We are running multiple products and environments under single Exceptionless account on large plan.

Sometimes one of products on one of environments starts sending a lot of exception logs to system and the system throttles all exceptions from all products and that's a problem.

Also throttling is not really throttling - it stops receiving any errors until hour completes.

We would highly appreciate if:

  • throttling will be applied on project level
  • we will receive e-mail notification that throttling applied
  • 1-5% of errors are delivered even under throttling

Currently we are creating temporary accounts for troubleshooting of such cases that happens once in months.

romanych avatar Jun 23 '15 14:06 romanych

@romanych:

  • How would you want it applied on a per project level? If you divided your plan limit by your project count, it would leave a few events per project (some projects will have less events than others),
  • This would be really nice to have and I think we should do this.
  • What do you do when you go over your plan limit completely? Throttling on a per hour basis at least allows you to stop large amounts of data to kill your monthly limit.

@ejsmith feedback?

niemyjski avatar Jun 23 '15 21:06 niemyjski

Yeah, I'm really not sure how this would work. I don't want people to have to configure this and I don't want to have a ton of options. So I'm not sure how it would work. Gotta think about it.

ejsmith avatar Jun 23 '15 23:06 ejsmith

Yeah, we need to figure out a better way to handle this.

niemyjski avatar Jun 29 '15 15:06 niemyjski

@romanych can you provide some feedback to my questions above.

niemyjski avatar Jun 30 '15 16:06 niemyjski

Almost done with pretty solid suggestion. Will do my best to send it tomorrow

@romanych https://github.com/romanych can you provide some feedback to my questions above.

— Reply to this email directly or view it on GitHub https://github.com/exceptionless/Exceptionless/issues/112#issuecomment-117239654 .

romanych avatar Jun 30 '15 16:06 romanych

@niemyjski, @ejsmith let me share my thoughts in this regards.

That there is maximum number of events/month per account and this number depends on pricing plan.

Exceptionless customers (like we are) are mostly interested in two things:

  • Stacks
  • Occurrences

We are using Projects to logically group stacks. So I am really interested to:

  • know when particular stacks start receiving a lot of occurrences so my plan could be exceeded
  • to see all other occurrences that are happening on other stacks - they can help to troubleshoot the issue
  • to see if fix that we rolled-out helped to solve issue and no new occurrences are arriving

These thought lead me to proposal to implement throttling on stack level, not project or account level. When I say throttling I mean setting some rate of occurrences can be logged.

To be more precise I would suggest to add following properties to stack:

  • AcceptRate (0.0001 - 1) - share of occurrences per stack that should be saved. By default 1 and should be calculated when applying throttling
  • ApproxOccurrences(timespan) - function that returns approximate number of occurrences during given timespan. If throttling was applied and only X occurrences were saved and counted - this function will help to have an idea on a scale of incident.

When user is watching stack which was under throttling now should be shown:

Exception occurrences are  throttled now. Approximate number is ApproxOccurrences(timespan)

I didn't went deep into Exceptionless event logging system so not suggesting implementation.

It was pretty easy stuff, most hardcore stuff it is how and when calculate AcceptRate.

Let's determine T as period when some job (e.g. Throttler) checks all occurrences and decides what should be done with AcceptRate:

  • decreased (apply throttling)
  • increased (unapply throttling)
  • leave unchanged

Accordingly to my observations you have very similar job with T = 1 hour which set AcceptRate to 0 on account level.

I would suggest to set T = 15 minutes.

Also I would like to define some functions:

  • MaxThroutput(minutes, account) = minutes * account.MONTH_EVENTS_LIMIT / (30 * 24 * 60) - maximum number of events that can accepted during given minutes so plan limit will not be exceeded
  • OccurrencesReceived(minutes, scope) - how many events were submitted during given minutes. Scope can be account, project or stack
  • OccurrencesSaved(minutes, scope) - how many events were actually saved.

It's easier to pseudo-code the algorithm.

For simplification I can suggest that each stack can get up to 25% of monthly limit.

foreach (account) {
 if (OccurrencesReceived(T, account) > MaxThroutput(T, account)) {
    account.isUnderThrottling = true;
    DecreaseAcceptRate(account);
 } else if (account.isUnderThrottling) {
    TryIncreaseAcceptRate(account);
  }
}

function DecreaseAcceptRate(account) {
  foreach (stack in account.stacks) {
    if (OccurrencesReceived(T, stack) > 0.25 * MaxThroutput(T, stack)) {
      stack.throttled = true;
      stack.acceptRate = 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack);
    }
  }
}

function TryIncreaseAcceptRate() {
  foreach (stack in account.stacks.throttled) {
stack.acceptRate = MIN(1, 0.25 * MaxThroutput(T, stack) / OccurrencesReceived(T, stack));
  }
}

There is a some obvious problems:

  • 25% allocating
  • if in two periods different stacks have spikes

Nevertheless I hope that my idea has sense and you will get some inspiration from it

romanych avatar Jun 30 '15 20:06 romanych

Thanks for your great feedback, I'll reread it a few times over the next few days and think about it as well. Just some feedback from the top of my head.

You can see these trends today on the error dashboard. There is a list of the most frequent stacks and you can view trending data over time. Granted there is things we want to do to improve on this.

Currently we are throttling organization wide. I think the finest grain of control we could reasonably go would be throttling at the project level. We have a http handler that throttles events based on the api key (project level) without even reading the stream / processing anything.. If we did it during the event pipeline that would add a serious amount of overhead to the system (now were queueing to disk and queue the event, then deserializing it and processing it via a job, then running it through a pipeline just to see if that specific instance is throttled).

  • know when particular stacks start receiving a lot of occurrences so my plan could be exceeded
    • I'd like to get notifications for this as well as when you are being throttled. Can you create a new issue for this.

I like your idea of having a max throughput should be 15 minute based. It would be nice to have a 1.5x rate per 15 minute period and if a project hits 75% of the rate in 5 minutes we throttle just that project.. Thoughts? We need to keep the logic simple I think because then it's easy to understand/test and it's going to be lightning fast (not slow things down).

niemyjski avatar Jun 30 '15 21:06 niemyjski

So the problem with this is that the majority of cost associated to accounts is the bandwidth and initial processing of the request. Our current policy is that we will throttle your account if your going over your plan, but if you are going significantly over your plan and incurring a lot of overage / cost then we would ask you to upgrade your plan. If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it?

Does this make sense?

ejsmith avatar Jun 30 '15 23:06 ejsmith

To your original feedback, I think sending an email letting you know when we throttle your account and including data showing you why it got throttled would be really good.

Also, what we are doing currently is throttling per hour in order to give you more of a sampling of events over the course of time. Maybe we need to change this up to be smaller windows so that their aren't long periods of time with no events.

ejsmith avatar Jun 30 '15 23:06 ejsmith

If we make this change then it's really just encouraging people to not worry about the fact that they are sending us a lot of data and costing us money. ie. if it doesn't affect them, then why should they bother trying to fix it

@ejsmith valid statement. So if stack is making trouble penalty must be more serious then throttling on a stack level.

@niemyjski, would it be possible to have project level throttling but with rate limit instead of rejecting all events? I think it will have quite good performance and allow better control. For instance project has sampling rate 0.1 in period and 100 errors were accepted it means that ~1000 events were sent. If 1000 is less than allowed throutput then accept rate can be set to 1, otherwise it can be adjusted. I think it is better then each T start collecting errors and then stopping it when allowed throutput exceeded.

Also it would be a good idea to decrease traffic amount to you (you are paying for it as well). Let me share case we had yesterday.

One of components processing queue and aggregates data into Redis. Redis was down for 30 minutes (memory exceeded, rebooted, went in read-only mode and finally recovered). This component is multi-threaded and deployed to 5 instances. Each thread start generating exceptions.

I would love to hear how you would solve the issue and make sure that in exceptionless some of exceptions are persisted?

Going further, I could imaging that in this case we can sample errors to send on client side. I don't know where you defining stack - client side or server side, therefore I don't know will it be stack or project level throttling. To me it looks like realtime config feature you recently delivered can help handle this. What are your thoughts?

romanych avatar Jul 01 '15 05:07 romanych

@ejsmith what do you think?

@all, I think that in this case where your redis queue went down maybe our client side code should be more aggressive at removing duplicates. We have all the data in our handler and could return a custom header for how much of your limit is taken up and then get really aggressive? Thoughts on this? But at the same time.. there is a major issue and things are going to get throttled for an hour (it would probably be fixed during this time). I'm also thinking that having a project level throttling may introduce complexities but might be worth a look??

Yes, you could always write a plugin as well to disable error submission via our client configuration.

niemyjski avatar Jul 01 '15 14:07 niemyjski

@romanych we currently throttle you for those exact reasons so that 1 event like your redis server going down doesn't eat up all of your events for the months. I think we can improve this though.

We could send a status code back to the client to tell it that the account is currently throttled (which it already does), but also include a sampling rate that it should use. So then the client would take that sampling rate and if it was 0.1 then it would only send 1 out of 10 events that it gets.

We actually used to calculate stack signatures on the client side, but the problem was that it made the clients too complicated to implement and also the calculation of the signature would get out of sync because people didn't update their clients. So now we try to make the clients as dumb as possible so that it will be really easy for people to implement clients in other platforms.

Here is what I am thinking:

  1. During the throttled period, still accept a sampling rate of errors. This sampling rate would have to be dynamically calculated based on velocity and plan limits. I think this is doable because we keep a counter of the current overage count for the throttling window and we know the plan limit.
  2. Send a throttled status back to the client along with the current sampling rate that the client should apply.
  3. Change throttling to be at the project level and use a setting to control what percentage of events a specific project should use. Percentage can add up to be more than 100% across all projects in an org and by default we would set this to 100% for each project so that it wouldn't affect the default behaviour. Expose this percentage in the manage project UI so that users could override this behaviour and keep a specific project from dominating the account. Maybe label it: "Maximum percentage of plan events this project can use"
  4. Have the clients include the number of events that it has discarded due to throttling since it's last submission. The idea is to get the client to discard events during throttling without sending them to us at all to reduce costs. But the problem is that the user doesn't know how many events are being thrown away due to throttling. If the client sends this, then we can increment our counters by the value and we would then have an accurate representation of the true volume of events and how many of them are being thrown away.

I think this would help to still get a sampling of events even while the account is throttled, but I still think it would be extremely likely that 1 type of event would dominate the account. It's hard to balance keeping the clients simple with the need to get a good sampling of the events. Maybe we could do a very simple version of stacking on the client by just hashing the event type and maybe the error types.

Thoughts?

ejsmith avatar Jul 01 '15 16:07 ejsmith

@ejsmith I like your ideas

romanych avatar Jul 01 '15 19:07 romanych

@ejsmith I heard from a end user today and each time they've had an issue they were throttled or reached there limit when they went and looked in exceptionless. They understand the limit but we need to work on getting email notifications when you are throttled.

niemyjski avatar Jul 13 '15 14:07 niemyjski

Do you have any plans to work on it. We have been throttled yesterday and we were unable to identify why? We were forced to upgrade without any spike in UI.

romanych avatar Nov 25 '15 14:11 romanych

@romanych this is something we need to work on, we could also use some help implementing it. We've spent the last few months working on performance and stability and we think we are done working in that area now as of this week. Would you mind sending me an in app message and I'll take a look into your account with you.

niemyjski avatar Nov 25 '15 14:11 niemyjski

We may also make this smarter by throttling by product version: https://github.com/exceptionless/Exceptionless/issues/156

niemyjski avatar Nov 25 '15 14:11 niemyjski

@stephenwelsh commented on Nov 17, 2015 In our scenario we have 1000+ clients that install our product (typically a desktop app) and although we have an in-place upgrade capability, it's optional and user driven. Therefore after a while we have a situation with a number of old installations with issues that have been resolved submitting irrelevant exception reports. Essentially it is out of our control to upgrade the older instances, however given the newer releases have resolved issues the submissions from older releases become less relevant.

Therefore we think it would be appropriate to control the clients submissions with a project level setting that enables/disables the client based on it’s version. For example:

Set the Project Configuration Setting: EnableVerson=4.2

In our application check the ‘EnableVersion’ setting from the Exceptionless client once registered, if the current application version is older (i.e. 4.1) then disable submissions. If the current version is the same or newer then submit.

In our situation the version number is enough for us to control which submissions are more/less relevant, however I would imagine there may be other criteria that may be of value to be able to leverage for enabling/disabling of submissions

niemyjski avatar Jan 12 '16 17:01 niemyjski

I think we could implement the version:latest functionality and send down the latest version down to the client in a header. Then we could get really smart and even allow you to turn off old clients. I know in one of our products we can get hammered with older errors that are no longer relevant.

niemyjski avatar Jan 12 '16 17:01 niemyjski

We just talked about this some more and will be updating this issue with specifics but we want to do some kind of sampling per project by sending down a header to the clients.

niemyjski avatar Jan 14 '16 20:01 niemyjski

Current thoughts are this.

  1. Add support for an EventRate header that gets returned from the event post API. This would tell the client to limit it's events to X per minute. The client would then use sampling to get a disperse set of events while trying to keep it's rate to what it has been told.
  2. Project will have a setting to control what percent of events the project should take up of the orgs plan limit.
  3. Server would take the various knowledge that it has and give intelligent event rates back to the clients. Using the overall org limit and project percentage as well as knowing how many clients are reporting events for this project. Maybe even know that 1 of the clients is sending the vast majority of the events and limit that one different than the other clients. Knowing which client is which is probably an issue since clients could be behind a proxy and all coming from the same IP.
  4. Allow the client to send a header value containing a count of the number of events it has thrown away. This number will be incremented on the project so that users can see how many events are being thrown out.

ejsmith avatar Jan 14 '16 20:01 ejsmith

That looks good and general for limiting over-rate submissions, some thoughts:

  • It should be disabled by default at the server side
  • A strong warning should be included in the server project to explain valid submissions may be suppressed
  • Ideally the event rate should be more granular than just the total for the project, if the rate was lowest for the server side evens with the high-test submissions (or based on client ip addresses with the highest submission counts), so something like that. Then if one particular event type or client starts to flood rare events may still get through

stephenwelsh avatar Jan 14 '16 21:01 stephenwelsh

We also have a pull request which will help out quite a bit on client side deduping: https://github.com/exceptionless/Exceptionless.Net/pull/71

niemyjski avatar Feb 29 '16 12:02 niemyjski

I also would like this. thank you.

ahmet8282 avatar Mar 29 '16 14:03 ahmet8282

Merged from #212:

Current throttling calculation: (total monthly events / hours in month) * 5

New

throttling calculation: (Events left in month / hours left in month) * 5

This would keep us from throttling accounts at the end of the month that haven't used a lot of their plan up. People feel cheated when we throttle and they still have a lot of events left in the month.

We can potentially calculate this rate once a day for each org depending on how expensive it is to calculate.```

niemyjski avatar Apr 06 '16 13:04 niemyjski