RocketMap icon indicating copy to clipboard operation
RocketMap copied to clipboard

Fix optimization issues with webhooks

Open kvangent opened this issue 7 years ago • 17 comments

Description

Fixes a bunch of issues with sending webhooks:

  • Stops RM from shooting itself in the foot with connection pooling
  • Fixes bug with frame intervals
  • Adds logging when a request's status code is bad (4XX or higher)
  • Increase default timeouts to more reasonable amounts (5.0s)
  • Increase default cache size to reduce duplicates (to 10000)

Motivation and Context

Currently RM causes a bunch of problems for the receiving webhook servers (aka PokeAlarm) by prematurely breaking connections and wasting resources on reestablishing connections. This causes errors such as the notorious 'broken pipe' error (errno10053 on windows and errno53 on linux) and decreases PA's performance. Fixing them is a win for both our users.

How Has This Been Tested?

I have not tested. I'll leave this to your user-base to verify the changes work correctly.

Screenshots (if appropriate):

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [x] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.

kvangent avatar Apr 08 '18 01:04 kvangent

👎 Sneaky, @kvangent.

I've got this strange feeling right now. As a community host, I'd call it disappointment. As a fellow dev, disappointment. If I'd be your parent, also disappointment. As just a regular person, disappointment in how you're choosing to handle this.

With the RM team in your PMs, as well as a handful of RM regulars in your own team who know we're very accessible, you had better options if your concern was really the users, and to write correct code.

The phrasing of your PR description is intentionally vague and undescriptive, leaving out the details to make sure that the non-technical reader or the reader who doesn't manually verify your code (with understanding of python, requests and urllib) doesn't understand that your claims are wrong. Merging this would reduce people's performance. Worst of all: you know better, but you're choosing not to tell them.

And then you don't even include the conversations we've had about this where I've explained to you why you're wrong. At least I consider myself happy that you didn't PR this to RocketMap, as you undoubtedly knew we'd reject it.

Here's what you really changed:

  1. Introduced duplicate code and some left-over, duplicate, half-broken comment because of pebkac (user error, your mistake). image
  2. Reduced performance by not re-using the underlying TCP connection. (At least this would be the result if you would actually have removed connection pooling, which you didn't.)
  3. Reduced performance by retrieving response content when we really shouldn't. Using the response headers without response content is much more efficient.
  4. Added one line of logging when the response code is not 200 OK.

All the above is just deleting work we've previously done on RM to make webhooks more efficient. Here's what you changed:

  1. Increased webhook LFU cache size default to 10k instead of 2.5k. OK, sure. No performance-changing edit unless the client is sending too many messages and the host hasn't configured it properly. But it's sensible to assume they might not fully understand it, and to increase the default.
  2. Changed integer division on wh_frame_interval to float division. Good fix, though this old code would only cause intervals to be incapable of using anything less than a full second as an interval setting (e.g. 5500ms would turn into 5s rather than 5.5s). Not exactly a bug that "decreases PokeAlarm performance".
  3. Increased default timeout to 5s instead of 2s. That's an insanely ridiculous timeout. Most web platforms or APIs are built to respond in a handful of milliseconds - BossLand's hashing service sometimes even responds in less than one millisecond (microseconds!) when I send it a hashing request from my server. Yet yours needs 5 whole seconds? That means you personally believe your tool is incapable of handling anything more than one per 5 seconds, otherwise it introduces lag? If that's so, then you should inform your users and stop recommending it for production use.

by prematurely breaking connections and wasting resources on reestablishing connections

It can't. RM closes a connection when it has received the response headers, which is all it needs. It reduces resource usage (you're removing that optimization) and never re-establishes a connection since it already received the 200 OK response. Retries only happen when the webhook server (e.g. PA) didn't reply, which happens if PA already can't handle the load.

Good research and changes start by making a list of all the author's assumptions and testing them properly one by one, to either validate or invalidate them so we can build changes we know are true.

With your PR, you only have to start with your first sentence:

Currently RM causes a bunch of problems for the receiving webhook servers

Yet the PA alternatives handle several tens of thousands separate messages per second without any hiccup or complaint. The devkat hashing server handles the same amount of requests and uses a pub/sub model so it has to wait for the worker's reply, which PA doesn't have to do.

Ah, but then I read on:

(aka PokeAlarm)

Yes, PokeAlarm and its architecture might indeed be the issue. Make sure to let us know if you can't handle the load, and we'll make sure to inform all users of this PA limitation so they can make an informed decision rather than having to butcher their performance and timeouts just to make it "bearable" for PA.

And next time we do benchmarks for PA on our own spare time, and we send you ideas for performance improvements, don't reply with a half-assed "no they won't improve performance", to then implement them yourself a day later because they did improve performance, like you did last time.

sebastienvercammen avatar Apr 08 '18 09:04 sebastienvercammen

@kvangent Thank you for contributing to this fork, it is appreciated. Your code will undergo review and test.

@sebastienvercammen and everyone else, please keep it civil in the future. We appreciate the code review feedback but there is no need for personal attacks. I see that words were chosen carefully above but you know what I mean.

tomballgithub avatar Apr 08 '18 19:04 tomballgithub

@tomballgithub Sabotaging repositories or people's setups because of incompetence isn't funny.

My comments aren't nice, but they aren't supposed to be. Do you think it's "nice" of someone to ignore good advice, to submit code that degrades all users' performance and to phrase things to be intentionally deceptive just because they're too stubborn to consider that their own code is what needs fixing?

We've tried the "be nice to people first instead of trying to be right" approach. We were ignored or rejected (and then our advice got implemented while those same people said to our face that we're wrong), and end users end up taking the short end of the stick because they're never informed (PA performance problems & their cause, PMSF security holes, chancity MitMing his "customers" and baiting them into a crypto pump-and-dump scam).

The whole of the Pokémon Go development community is filled with scummy opportunists only looking out for themselves and their own gain.

We really wish it wasn't all like this, but it is. And I'm done "being nice first".

If people's feelings get hurt, perhaps they shouldn't be submitting bad code, ignoring good advice, or trying to scam people in the first place.

Time for everyone to take a look at themselves first.

That's my code review: this is trash, and it's time to start acknowledging it.

sebastienvercammen avatar Apr 08 '18 22:04 sebastienvercammen

I'm really mystified how these tools haven't moved to a proper message bus at this point, instead of rolling their own webhooks.

jmslagle avatar Apr 09 '18 01:04 jmslagle

@jmslagle Not advocating for any specific tool (I don't use any of them), but PoracleJS (https://github.com/kartuludus/poraclejs) uses a message queue (RabbitMQ), and novabot (https://github.com/novskey/novabot) reads directly from the database.

sebastienvercammen avatar Apr 09 '18 10:04 sebastienvercammen

It took me a while to compose a response to this. I’m slightly stunned by the absurdity of it all. To drop to the kind of personal attacks and baseless accusations feels out of character, even for you. Several of your accusations don't really make sense to me - I’m not really sure how I stand to gain from 'sabotaging' repos. I’ve double checked, but none of my changes include any secret bitcoin mining or anything.

@sebastienvercammen - Given the level of hostility in your posts, I get the feeling that you feel that you have been attacked personally. I’m not really sure why this is - I don’t think anything in my PR has personally implicated you or attacked your character in any way. There was a problem, I've diagnosed a solution and even recommended changes to fix it. This is no way amounted to an attack on your character, and I'm sorry if you feel otherwise.

The mature adult response in this situation is to politely leave comments stating why you feel that changes will have a negative effect. In an ideal world, you would leave tangible reasoning or some evidence to backup your conclusions. However, you clearly felt that a long rant lacking any real technical depth was more appropriate. I guess that's your prerogative.

So let me assure you, and the other users of this PR - I’m not trying to pull a fast one one on anyone. I’ve already approached one of the devs of this fork about making these changes, and, like a mature adult, they made comments and we discussed them before agreeing to a set of changes that make sense. The reason I didn’t leave a long technical explanation is because my overview is accurate, and matching a the deeper technical explanation I’ve already presented to them.

So let’s leave all this drama behind, and talk about the changes themselves.

your claims are wrong. Merging this would reduce people's performance. Worst of all: you know better, but you're choosing not to tell them.

dude I’m comfortable with my claim that these changes will lead to significant performance improvements. But hey, you don’t need to take me at my word - I’ve got sources and reasoning. Let’s take a look at what my changes actually did and the motivation behind them (again all stuff I’ve already shared with devs of this fork).

But before we start, let’s take a quick moment to briefly cover two concepts (in case some readers aren’t familiar with them, but still want to follow along).

Connection Pooling - When you want to send a request to a server, there is a bunch of underlying work that’s required before you can actually send this data. You have to create a socket, do a bunch of handshaking - all before you can actually send the data. By keeping a ‘pool’ of the connections, you can reuse connections over again and save on overhead. (Note: doing this is beneficial to both sender and receiver, because the connection requires work from both parties). Python’s great Request module does this for you almost automatically - you just need to use a Session object to send your requests.

Webhook Frames - If you are sending several requests consistently in a short period of time, you can group them together and send them at once (again to reduce overhead). A good analogy is sending letters - it’s easier to send a big envelope with 100 letters than to send 100 letters all on their own. When implemented correctly, the sender saves up data over a certain time period (say 500ms) and sends it all at once.

Now, with the basics in mind, let’s move onto the changes:

(Disclaimer: I’m going to ignore the unintentional duplication of code - that was indeed just an accident. You may also find additional mistakes such as spelling or grammar throughout this post. I'm a human, and thus prone to error. I'm comfortable that I would have eventually looked at the diff and noticed the change, but thank you Seb for pointing it out)

  1. I removed the ”Connection”: “close” header from being sent with every request. This header tells the receiving server to close the underlying TCP connection once it has sent it’s response. You can’t reuse a connection if you immediately close it after the first request. https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html https://stackoverflow.com/questions/7823007/setting-request-header-connection

  2. I disabled response streaming and connection closing before a response was read. Typically, an application waits for the entire response to be loaded before processing. If you stream the response, Requests let’s you start processing it before the entire response has arrived.

If you check the Requests documentation (link below), it has this to say about streaming: “Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.”

So, you can’t reuse a connection until you’ve read all the data. Instead of reading rest of the response, RM’s current code just closes the connection (you also can’t reuse a connection if it’s closed - in case anyone’s wondering). This also can create problems for the sending server, which is still trying to send the request. Additionally, since RM has closed the connection while PA is trying to write to it, PA now has to deal with errno10053 (which happens when someones tries to write to a connection that’s already been closed).

http://docs.python-requests.org/en/master/user/advanced/

  1. I fixed an integer division error that rounded down the webhook frame interval. As Seb stated, this means that 5500ms actually becomes 5s. Since the default setting is 500ms, this means it actually becomes 0s. This means there is no wait time between requests - every single one is sent on it’s own. And, because of the first two bugs, this forces the sender (and, coincidentally the receiver) to have to create a new connection for every single new scan that is found. Since the typical setup often includes many RM instances to a single PA instance, this means PA is drowned in connection attempts. (If you missed the earlier bit about how expensive connections can be, this “decreases PokeAlarm performance").

  2. I increased the size of the webhook cache from 2500 to 10000. Previously, RM remembered the last 2500 things it sent, and only sent a new thing if it hadn’t already sent it before. Now RM remembers 10000 things. This means less things get sent twice. I’m not going to spend a bunch of time here, since Seb seems to agree that increasing the default is sensible. The ideal number for this value would be roughly the highest number of scans one expect to encounter in an hour.

(An even better improvement would be a time based cache, but I was going for low hanging fruit here.)

  1. I increased the default webhook timeouts from 1s to 5s (I believe Seb incorrectly states 2s above, but it doesn’t really matter). This means that when a request is sent, RM will now wait 5s for a response instead of 1s. It’s not just the time that PA takes to process the data - it’s the time for the request to be sent, processed, and a response to be sent and processed.

The first problem with Seb’s argument, he’s incorrectly implying that an application is somehow solely responsible for the overall response time. The truth of it that latency has many factors. Since most of ours users are hobbyists and likely not sysadmins, they probably aren’t running on production grade equipment. As such, It’s an unfair and inaccurate comparison to make. Sometime as simple as an unreliable connection (like maybe wifi?) or a throttled connection could easily causes a transient error leading to increased latency.

The second problem with Seb’s argument is that he doesn’t seem to understand that requests can be handled asynchronously (this means the server can take turns working on multiple requests without finishing them completely). If RM is performing correctly, it will send one request every 500ms, which will create 2 connections a second. If for some reason the 1st one does take the entire 5 seconds to respond, it should still finish just in time for the 11th one to use the connection. This means that at most with this setting change, RM will only be using 10 connections at a time (the default pool size for RM is 25). The correct solution to webhook queue build up is NOT to lower timeouts - it is to increase the pool size until there are enough connections open at once.

Finally, just to add some additional context: Firefox uses a default timeout of 30 seconds for connections. For PA, Telegram and Discord defaults are both set to 5s. When we set them lower, users reported issues and dropped notifications. A default timeout of 5s is not only not absurd, but it https://stackoverflow.com/questions/13582409/http-client-timeout-and-server-timeout

  1. I added a check to the response code of the response, and log if it’s not what expected. (Seb is correct that in PR’s current form it’s only checking for 200, but I plan to improve this shortly.) This is more of a quality of life improvement, and doesn't really affect performance in a significant way.

There it is - the incredibly long, detailed explanation of why I’ve recommended changes. It’s not sneaky, or a con (still not sure of my motives for either). If my changes are really 'trash', I'm more than open to hearing the reasons for those assumptions.

Back to @sebastienvercammen - I will however, say this: you seem to be under the impression that all the other leaders in the community are causing problems. Have you heard the saying - “if you have problems with everyone around you, perhaps you are in fact the problem?”. Perhaps if you are sick of being treated like the bad guy, you should stop acting like it. (If you are unclear, unwarranted hostility is typically a ‘bad guy move’).

You are welcome to argue with the changes on a technical level, but I’ll ask you kindly to please refrain from any more personal attacks in this PR, or any criticism not related this PR from a technical perspective. This fork belongs to neither of us, and it is not the appropriate place for us to continue to air grievances.

This includes any misimagined slights you have over your suggestions on PA’s performance. If you (or anyone else) feel you have real suggestions that can improve PA’s performance, I would strongly encourage you to open an issue on the PA's Github so that we can discuss it in a more appropriate setting. I've always been more than happy to accept feedback.

kvangent avatar Apr 09 '18 14:04 kvangent

@kvangent

Have you heard the saying - “if you have problems with everyone around you, perhaps you are in fact the problem?”. Perhaps if you are sick of being treated like the bad guy, you should stop acting like it.

My problem isn't that I feel I'm a "bad guy". I put in a lot of effort to do right by people and I've spent a lot of time talking to a lot of people privately to resolve an issue which I felt was due to miscommunication. I'm pretty happy about things.

However, I'm sick of a small handful of people like yourself who twist words to condemn another person (in this case, me) for a certain thing while simultaneously doing that thing themselves. It's a hypocrit's dream tactic: "You're a bad person for making this personal! Yet here's this quote that explicitly does the same thing!" In the pogo dev community, it's also always the same people.

You mention absurdity, but there is nothing more absurd than the mental gymnastics you've gone through to make that personal attack and simultaneously convince yourself that I'm the one making unnecessary personal attacks.

You seem to be under the impression that all the other leaders in the community are causing problems.

You're welcome to tell me which one of my claims is wrong or never happened:

  • PA performance problems
  • PMSF security holes
  • chancity MitMing his "customers" and baiting them into a crypto pump-and-dump scam

However, you clearly felt that a long rant lacking any real technical depth was more appropriate.

Except we've given you all technical reasoning beforehand, weeks ago. But you choose to ignore our advice. Just like the time before: you tell us we're wrong, ignore our advice, but implement some of the changes anyway the day after because we were right, then you somehow end up on this PR blaming me for never including technical reasoning. By your own words, you're a hypocrite.

But to get back to the technical arguments...

Typically, an application waits for the entire response to be loaded before processing.

Depends entirely on implementation. Not a real argument. It's like saying "I implement it by consuming the full response content, so I'm right".

By that logic: we implemented it to only read the response headers and ignore any response content because the content is irrelevant (RocketMap never uses any response content for webhook messages, it only needs to know that it arrived). "Since we implemented it that way, we're right."

If you check the Requests documentation (link below), it has this to say about streaming: “Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.”

You quote pieces of documentation without quoting them in full:

Requests cannot release the connection back to the pool unless you consume all the data or call Response.close.

So why do you think we call Response.close?

If you consume the content, urllib3 itself handles the underlying connection and requests releases the connection to the pool. If you enable streaming and only use the response headers, you can skip the response content but must release it to the pool manually by calling .close (per request's implementation).

RM’s current code just closes the connection (you also can’t reuse a connection if it’s closed - in case anyone’s wondering).

Ah, here we're getting to the important part.

So why didn't you include the details about how we got here in the first place? Or why we even added read timeouts (although they're always good practice, RM didn't have one a while ago)?

  • PokeAlarm adding infinite amounts of delay and infinitely growing the queue of unprocessed messages because it can't keep up.
  • PokeAlarm taking several hours to respond to a request (yes, users were having queues so large because PokeAlarm wasn't replying to any request in a timely manner that it took over an hour to receive a response, causing all Pokémon to be expired).
  • This infinite growth caused a never-ending cycle of using more and more and more resources both in PokeAlarm and RocketMap (and all services tied to RocketMap) since RM needs to use resources for every pending request to PA.

"Infinite" is a pretty high amount.

And worse: during our PA benchmarks (with one of your team members by the way, it wasn't just me) we noticed that once PA hit its processing limit, it slowed down to processing ~5 requests per second.

You built a process that can't scale and that can't gracefully handle its own limits.

Users continuously hopping into RM's #help channel for problems they don't understand (all related to PA being unable to process the requests), and telling us PokeAlarm's team (on your Discord) is telling them "it's a RocketMap problem".

So put yourself in our position. We're RocketMap, and we're noticing that third party projects are having performance issues, don't have a proper messaging architecture, aren't gracefully handling connections (and rejecting new ones when their service can't handle the load, compounding the problem) and aren't listening when we're telling them about the problem (your response back then was "RocketMap has issues").

We added limits to webhook message concurrency, reworked parallel requests, added timeouts, and started forcefully handling connections because we couldn't trust third parties.

Your reaction to these changes back then: "RocketMap has issues, it's not PokeAlarm, you need to increase the timeout to 5 seconds or higher if you're having problems". Your "support" consisted of telling your users to keep increasing their timeouts on RocketMap, re-introducing the problem of the infinite growth that those timeouts solved in the first place.

This means that when a request is sent, RM will now wait 5s for a response instead of 1s. It’s not just the time that PA takes to process the data - it’s the time for the request to be sent, processed, and a response to be sent and processed.

  1. BossLand takes microseconds to respond to a hashing request. You're confirming that you need up to 5 seconds to do less work (since you don't need to do any processing, a proper messaging architecture queues the message and replies 200 OK to the client, letting background workers do the processing work).
  2. Even if you disregard BossLand, PokeAlarm alternatives have no issues whatsoever, even when processing a number of messages that are orders of magnitude higher than what PokeAlarm handles.
  3. Your wording about timeouts is misleading. RocketMap has a 1-second connect timeout and a 1-second read timeout. It will wait up to 2 seconds: one second for the server to accept the connection, another second for the response to be received. But even 1 second is high - most users run PokeAlarm on the same machine as RocketMap, so you're telling us PA needs more than 5000 to 10000 milliseconds to receive a localhost connection and submit a message onto its queue.

The second problem with Seb’s argument is that he doesn’t seem to understand that requests can be handled asynchronously. If for some reason the 1st one does take the entire 5 seconds to respond, it should still finish just in time for the 11th one to use the connection. This means that at most with this setting change, RM will only be using 10 connections at a time. The correct solution to webhook queue build up is NOT to lower timeouts.

Congrats, another personal attack.

Here's what you don't understand:

  1. PA is incapable of performing under regular load, needing up to 5 - 10 seconds for a single request (and infinitely adding more delay, so no timeout = no limit = PA consumes the world).
  2. PA doesn't respond to requests on time, starving RM's connection pool.
  3. Instead of fixing PA's response time, PA keeps yelling that RM needs to just "increase its connection pool size and timeouts!"

Increasing the connection pool size leads to more and more resource starvation.

PA needs up to 10 full seconds (or even 5 seconds if we ignore the connect timeout) to receive a JSON POST message and submit it to a queue. That's absurd.

When a farmer has a milk cow that no longer produces any milk, he doesn't give it more food, starving his budget (both financial and food) on this one cow who isn't going to improve the livelihood of his family anymore. The farmer kills the cow and eats it.

RM adding hard timeouts = RM killing the terminally ill cow called PokeAlarm so it stops eating all of our users' server resources.

All PA alternatives are still running perfectly fine and have no issues whatsoever.

Finally, just to add some additional context: Firefox uses a default timeout of 30 seconds for connections. For PA, Telegram and Discord defaults are both set to 5s.

Apples and oranges. Firefox is a client browser - it needs to accomodate for the average non-technical user and bad web servers. End users would blame Firefox if it didn't wait at least 30 seconds, but in the technical world it's absurd.

PA, Telegram and Discord are all non-time-sensitive public APIs with millions of requests. Intentionally buying and configuring servers to have a minimum acceptable delay makes business sense because it reduces cost by building systems to respond within that timeframe. You'll notice that their systems don't take more and more time to process your requests for every message: they're built to be consistent.

All that doesn't apply to PA.

PA is a single, tiny process that needs to handle only a single instance's load on localhost. It can't, and it's blaming others for it.

Whether you set 1s, 5s or 100000s timeout, PA will break if it can't keep up, leading to missed (dropped) messages. Increasing the timeout only delays the inevitable, it doesn't fix the underlying problem.

This includes any misimagined slights you have over your suggestions on PA’s performance.

Your words just now, and my benchmarks (with Dongo from the PA team) confirms that:

  • PA needs absurd timeouts because it can't handle the load (up to 10 seconds [connect + read timeout] on localhost).
  • PA breaks down entirely once it hits its limit, crawling to a full stop.
  • PA compounds the problem because of the above. This puts all users' machines at risk because PA will endlessly consume resources, starving all other processes.
  • PA can't compete with the performance of alternatives (not even on the same order of magnitude) and PA isn't looking to change that either. You'd rather continue to ignore us, argue and blame RocketMap (or myself).

So PA has a critical problem and its team won't fix it. Thanks for finally confirming that publicly.

The other minor edits (e.g. integer to float division) are unrelated to all of this, but they're good edits and will be merged to stock RM.

And all of this isn't about the code, because the code speaks for itself. It's about the same thing it's been about since the first time it came up: you don't want to listen, you don't want to fix PA's response time. It's OK - but now we know.

All the back-and-forth just because you don't want to fix code is unnecessary, and tiring. And stop telling users that it's an "RM problem" when PA can't handle the load, causing them to ask us for help on fixing it. If you can't be truthful about your application and keep sending people our way (causing unnecessary work for our volunteers), we'll have to remove support in favor of alternatives.

sebastienvercammen avatar Apr 09 '18 15:04 sebastienvercammen

@kvangent , thanks for the PR, it will be reviewed by the dev team. @sebastienvercammen , thanks too for your feedback, I willn't repeat what @tomballgithub said, but please do it.

If both of you have problem with each other, thanks to do it in private message.

rlodev67 avatar Apr 09 '18 16:04 rlodev67

@sebastienvercammen - Again, I'm going to encourage you to open an issue regarding PA's performance if you have concrete suggestions for improving it. Making statement's like 'it isn't fit for production use' and claiming that confirming that you need up to 5 seconds to do less work is misleading and incorrect. And again, you seem to be applying that the application's implementation is solely responsible for performance and response time. This is also incorrect - there are a variety of factors including hardware and environment. Regardless, this is not the appropriate place for this conversion.

Regarding my suggested changes, even if I am incorrect in my assumption that resp.close() closes out the tcp connection and does in release the connection back to the pool, there are still two issues preventing RM from correctly connection pooling that I don't believe you have addressed: First, it's using a "Connection": "close" that closes the tcp connection once the request have been completed, and second, the frame interval is broken for the default settings, causing more connections then necessary to be created if the user hasn't increased it. These two issues still prevent connection pooling from working successfully, regardless of if streaming is implemented or not.

Finally regarding timeouts, the reason I used only 5 seconds in my example is because if connection pooling is implemented correctly, the connection timeout only applies on creation - after that the connection should be reused. Even if (in a worse case scenario) an entire request takes 10 seconds to response for some reason, the requests are still being handled asynchronously. If RM is only sending 2 requests per second over 10s, it needs approximately 20 open connections, which is still below the current default pool size of 25. Increasing it has no adverse effects (when the requests are handled correctly), but it does give the benefit of having a wider window to avoid issues if a transient delay occurs (however unlikely).

kvangent avatar Apr 09 '18 16:04 kvangent

@rlodev67 @tomballgithub Sorry for hijacking the repository. This has been going on for weeks, it has caused unnecessary work for our volunteers and it has exhausted our patience.

If you focus on reading only the technical paragraphs, you'll find everything you're looking for to understand why this PR is bad for your users.

sebastienvercammen avatar Apr 09 '18 17:04 sebastienvercammen

@sebastienvercammen You guys should move webhook processing out of the core code path, pushing it off to something like a STOMP or AMQP solution, and then just move the processing code there. Then you internally can just add crap to the queue and forget about it. You're happy - PA can use a consumer that consumes and sends the HTTP webhooks, and people who care about performance can just write an adapter to read directly from the queue.

Can also get cool things like pub/sub with multiple consumers, etc then.

Just a thought

jmslagle avatar Apr 09 '18 17:04 jmslagle

@jmslagle We've thought about it (e.g. gNATSd handles several million messages per second, and AMQP has more features to work with), but the rework would break all webhook-based third-party projects (of which there are a lot, most of them private).

It's still on the table though (we'd give enough time for those projects to adapt, or implement a fall-back webhook option at first), but it's understandably not the highest priority.

RE: Degrading performance: With PA being the most widely used webhook-related third-party project, which is a slow consumer, we knew we'd affect our users' setups significantly. Even if we include the details in the announcement, many of them wouldn't understand what it means or would update without reading it. Increasing timeout to 5s is not a sensible decision either way.

In its current implementation, there is not much to think about for webhooks on RM's side. We push it to the queue and let background workers pick up the messages and send them off (at which point they become background I/O handled by the OS and requests). All of this is just moving around some references. It's only when we didn't have timeouts that the third-party projects (e.g. PA) could control RM's behavior by infinitely stalling it if it gave no response.

Re-using TCP connections (as we originally did, as seen in my code comments) is a good move and was primarily changed to ensure connections are closed so the connection is never affected by a slow consumer (and because it reduces time spent on the response content), though that won't fix PA being overloaded (increased throughput -> speed up PA being overloaded).

sebastienvercammen avatar Apr 09 '18 17:04 sebastienvercammen

I like how Novabot pulls from the map's db.

darkelement1987 avatar Apr 09 '18 18:04 darkelement1987

@sebastienvercammen Just bundle a consumer that sends the current webhooks - problem solved with instant backwards compatibility.

Basically just refactor existing webhook sender to consume from the queue, and make the existing codebase send to the queue.

GNATS is fine too Choria uses NATS (nats.io)

jmslagle avatar Apr 09 '18 19:04 jmslagle

@jmslagle Yes, that would have the same result. :) We're just painfully aware of the negative response that will come with having to go through the extra setup even if they want to continue to use the old behavior (webhooks).

Just a user/community thing though, nothing technical, and nothing that would stop the implementation. Just venting a little bit.

sebastienvercammen avatar Apr 09 '18 20:04 sebastienvercammen

Ok, after read everything on this PR:

  • I'm ok to increase the LFU cache size, it won't affect performance.
  • I'm also ok to convert wh_frame_interval as a float, it's a litle bug fix.
  • About the connection pooling, I have to read more on details the docs, but with your comments, I think both are the same.
  • About setting timeout to 5sec, some tests need to be done, I agree with both of you on this point: 5sec to send an http request, it's too long but as Deathly said, some people aren't running it on a good server with good bandwith etc.

As a result, I don't think this PR will broke something or decrease performance, but some tests need to be made to be sure, and they will.

Thanks to both of you :)

rlodev67 avatar Apr 13 '18 17:04 rlodev67

@rlodev67 - For an example of the effect between reusing connections in the pool vs closing them, here are some example benchmarks with the effects to PA:

(There are done on a 2x ~2GHz 2GB vm running linux, from a crappy laptop over wifi. I used 'hey' a go based webserver load tester. "With key-alive" is with "Connection: "keep-alive" using in the header where as "With close" is "Connection: "close". Each request contains 100 events.)

50 connections With keep-alive: 340 r/s, avg response 148ms, total of 29.4s With close: 320 r/s, average response of 158ms, total of 31.1s

100 connections With keep-alive: 348 r/s, average response of 285ms, total of 28.6s With close: 298 r/s, average response of 334ms, total of 33.4s

200 connections With keep-alive: 303 r/s, average response of 602ms, total of 32.9s (1 dropped connection) With close: 308 r/s, average response of 402ms, total of 32.3s (50 dropped connections)

There is a noticeable improvement in the requests per second with keep-alive. Additionally, the number of connections causes a significant increase in response time, but not a significant decrease in throughout. PA's 'slow consuming' occurs when it is handling too many connections, which is a familiar problem for web-servers. It's still processing them at roughly the same time overall, but it does takes longer to respond to individual requests.

We put in a configurable 'hard-cap' on incoming connections via the concurrency argument several months ago to prevent the response time from spiraling towards infinity. The default is currently at 200, which is why PA starts to drop connections around that level. I'd like the default to be lower (to keep response times lower), but with RM's current setup this increases the chances of events being dropped entirely.

kvangent avatar Apr 13 '18 22:04 kvangent