hyper
hyper copied to clipboard
Stream stacking occurs when H2 processes HTTP2 RST_STREAM frames. As a result, the memory and CPU usage are high.
Version hyper-0.13.7 h2-0.2.4
Platform wsl
Description
Summary:
Stream stacking occurs when Hyper processes HTTP2 RST_STREAM frames. As a result, the memory and CPU usage are high.
Step:
-
Send an
HEADERSframe to open the stream, followed by anRST_STREAMframe to request cancellation of the streamdef reset_flood(h2_conn): for i in range(1, 20000): if i % 2 == 0: continue headers(h2_conn, i) rst_stream(h2_conn, i) -
Create multiple threads for sending.
if __name__ == '__main__': for i in range(0, 400): try: _thread.start_new_thread(send, ("Thread-" + str(i),)) except: print("Error: Can not start thread") while 1: pass
Result:
The CPU usage of the Hyper server keeps increasing. As a result, the VM memory is used up.

Vulnerability analysis:
When receiving a HEADERS frame, the h2 stores the corresponding stream content in the slab and sets a frame index to the ids. When receiving an RST_STREAM frame, h2 sets the stream to Closed and resets the related statistics. However, the stream memory is released only when stream.is_released() is true . When the RST_STREAM frame is received, the release is not triggered immediately. As a result, the size of the slab increases continuously.
Test procedure:
Add the slab_len(),ids_len() method for Store to return the length of all flows and active flows.

Add the preceding two stream lengths to the recv_reset() method.

After the test, when the HEADERS frame is repeatedly sent to open a stream or the RST_STREAM frame is sent to cancel the stream, the length of the ids does not exceed the value of max_recv, but the SLAB increases .The stream in the Closed state in the SLAB is released only after all frames on the connection are sent.

The max_concurrent_streams configuration can limit max_recv_streams, but it appears that in this scenario, the size of Slab is much larger than the max_concurrent_streams value and stacks up.
I think it is necessary to limit the size of the Slab or ensure that streams in the Slab can be released quickly after the RST_STREAM frame is received to prevent such attacks.
@seanmonstar What do you think of this issue? I think hyper has some problems in dealing with this scenario.
Sounds like something we could try to fix. What is making the slab hold onto the memory? My guess is that it needs to hold the stream info while there is an Event that can be polled, so that a user accepting requests will see it was reset, is that it?
@silence-coding thanks for your code. @seanmonstar pls check , thandks
Please assign a CVE ID to this vulnerability.
CVE-2023-26964
GH Dependabot is now also raising a Security alert
https://github.com/advisories/GHSA-f8vr-r385-rh5r
Is there anything we could do as reqwest library users to mitigate the attack surface?
Um. Ok. I'm sure you just wanted to see this fixed. But this is NOT how to do things. Filing a drive-by CVE like this is like seeing a lighter in a school and pulling the fire alarm.
There is a SECURITY policy if you ever feel something is a vulnerability. Then we can better coordinate.
Be Kind. That includes considering we're all humans contributing, and respecting time is part of being kind.
I'm also extremely disappointed that @github facilitated this.
That said, fire alarm time.
If you're here from a dependabot warning, or similar: I'll be working on this today. I will update this issue when there is a version you can upgrade to. Some things you can do to determine if you can just safely ignore the dependabot warning:
- If you're not using HTTP/2, ignore.
- If you're using a client, with HTTP/2, but you're talking to your own servers, or you at least control the URLs to talk to servers you trust to not attack your client (most businesses), ignore.
The fix will likely be an h2 version, hyper will not need to be changed.
This CVE is going to cause a flurry of unnecessary emergency response work across the ecosystem--even for use cases that are totally un-impacted by the bug. Every application that depends on Hyper is now going to have to ship out-of-cycle fixes to address the CVE, even if the application has nothing to do with HTTP/2.
Going through the documented SECURITY process isn't a feel good exercise: it allows stakeholders to coordinate about the severity and breadth of an issue's impact.
Not to pile on @github's security team, the actual CVE is "better" insofar as it says "awaiting analysis"... https://nvd.nist.gov/vuln/detail/CVE-2023-26964
The GH Security Advisory, in contrast, is unnecessarily alarming with its High Severity label... https://github.com/advisories/GHSA-f8vr-r385-rh5r
An update: from what I can tell, the report is that the slab store of streams grows because its holding two events, [HEADERS, RESET] essentially, so that the user can notice them when they call conn.accept(). I have the skeleton of a unit test, it's not that great, but it lets me adjust the conditions a bit: https://github.com/hyperium/h2/compare/sean/reset-streams-growing.
It appears that as long as the server is accepting requests, the slab is actually cleared out. Looking in the original report, I can't find what the server is doing (it's target/release/server2?).
Also, here's the chat thread for this: https://discord.com/channels/500028886025895936/1095696565689122967
This CVE is going to cause a flurry of unnecessary emergency response work across the ecosystem
What do you mean with "unnecessary"?
Finding out whether or not you are affected isn't unnecessary. Updating to an unaffected version is not possible, because there is no unaffected version as of now; every version of hyper is potentially vulnerable. The only thing people can do is to disable the "h2" feature if they have it enabled, which would prevent this attack.
What do you mean with "unnecessary"?
The whole point of a SECURITY policy is to respond in a reasonable way - reproduce identify scope, write tests, prepare fix, etc. Should we really report a CVE before confirming with the maintainers that it's 1. reproducible and 2. something that has an available fix ready to deploy? It would have been better to follow the requested SECURITY policy so that the response can be more coordinated. It's better to be sure first before declaring a vulnerability.
Exactly; additionally, there is no world in which this is a "high severity" CVE. You can use this tool to calculate the severity.
This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.
I agree that it would have been better to report the vulnerability in private, but since it's already been public for 11 months, it's too late for that. Complaining now that people didn't follow proper procedures is disingenuous at best, because there was enough time to fix this vulnerability before the CVE was filed, and it was clear that the maintainers didn't see it as a priority.
The reason why vulnerabilities should be kept secret until there's a fix is so that attackers don't learn about the vulnerability before the public had a chance to fix it. This does not apply to this vulnerability, however, because it was already public; an attacker could easily find the vulnerability in the GitHub issue tracker.
Mod here: please use more empathy. It's fine to disagree, but consider the difference between your comment and the others that "disagree", and why they weren't hidden.
This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.
It is still very unclear whether there is a vulnerability or an issue. The submitter claims there is one but needs to provide a report by which they achieved that result. The maintainers have yet to reproduce under normal conditions. The only way to reproduce seems to be either a) misusing h2 or b) an application that lags behind and explicitly is asking for the state to be held (though it could be that the implication of the API is unclear).
This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.
We don't even have an official repro case. That's the entire point of the SECURITY policy - ensure the problem exists, realize its severity, etc. It's not to avoid blame of a vulnerability - just to better coordinate the response.
This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.
We don't even have an official repro case. That's the entire point of the
SECURITYpolicy - ensure the problem exists, realize its severity, etc. It's not to avoid blame of a vulnerability - just to better coordinate the response.
Does the issue not show how to reproduce this. I'm sorry but I don't exactly blame the person who made the CVE here. This was reported and was marked as moderate on the CVE website.
I can't imagine it's that rare of a situation for people to use http2 streams here, so I don't know why it's being minimized so much.
Be kind should also apply to not scolding your community members. I'm sure this wasn't meant out of malice and was simply meant to bring awareness to the issue at hand.
That being said, i'm seeing people refer to it being marked as high severity, but both github and nist are reporting as moderate, which I think is fair.
That being said, i'm seeing people refer to it being marked as high severity, but both github and nist are reporting as moderate, which I think is fair.
GitHub initially had it marked as high. It was updated an hour ago
https://github.com/github/advisory-database/commit/aa9e5d5386c5610944edf2b0ee0e4301aabaf1c5
Does the issue not show how to reproduce this. I'm sorry but I don't exactly blame the person who made the CVE here. This was reported and was marked as moderate on the CVE website.
No, it looks like the maintainers/contributors were having a hard time figuring out exactly how to reproduce it/where the vulnerability is, or if it's just an API misuse. As I understand from the fire alarm discord thread, the large number of requests needed to reproduce in OPs case also confuse it. In general it's not very clear how/where the vulnerability lies. and this is emergency work that the maintainers now have to perform, without even a super concise repro case, instead of it being disclosed properly.
It's generally good practice to at least mention to the maintainer that you'll be filing a CVE, even if just to give advance warning, but also just to see if you are misunderstanding it. Even if it turns out to be severe, it's still best practice to communicate with maintainers about it for all of the reasons above.
Does the issue not show how to reproduce this.
It unfortunately does not. About 4 people have been working on this all day trying to figure out what the actual reproduction of this is. This makes it quite stressful to just guess at what a fix is, when we can't confirm that's what they meant in the first place.
I'm sure this wasn't meant out of malice and was simply meant to bring awareness to the issue at hand.
I acknowledged in my first comment this morning that I assume the person just wanted it fixed, not assuming they were being evil. But this is not the right way to get attention. There were several more steps that could have been taken.
Another update, this should be classified as low severity. The attack requires sufficient bandwidth to overload Hyper's accept loop. The original submitter 400 threads over localhost against hyper's 1 thread. This would not be easy to trigger over a network (though possible). Additionally, systems return to normal once the attack completes.
In practice, I don't expect many to be vulnerable to such an attack due to the difficulty of triggering it via remote network and in terms of severity, a botnet attack triggering a DOS by overloading the pipe would be a greater risk.
That said, the Hyper maintainers are currently investigating a patch that could mitigate this specific attack.
Exactly; additionally, there is no world in which this is a "high severity" CVE. You can use this tool to calculate the severity.
So this might get a little off topic, but I actually took the time and did compute a score and ended up at a whopping 7.1 score in the 3.1 system, which is pretty high. feel free to debate(hope the link works):
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H/E:U/RL:U/RC:C/CR:L/IR:L/AR:H/MAV:N/MAC:H/MPR:N/MUI:N/MS:U/MC:N/MI:N/MA:H&version=3.1
feel free to debate
I'd rather not debate scores from the CVSS calculator here. Different people trust it differently, and I'd like people subscribed to this issue to be able to just get updates of the actual problem, if it exists, what it is, and when there is a fix.
(So I'll be hiding yours and my comment.)
Another update, this should be classified as low severity. The attack requires sufficient bandwidth to overload Hyper's accept loop. The original submitter 400 threads over localhost against hyper's 1 thread. This would not be easy to trigger over a network (though possible). Additionally, systems return to normal once the attack completes.
In practice, I don't expect many to be vulnerable to such an attack due to the difficulty of triggering it via remote network and in terms of severity, a botnet attack triggering a DOS by overloading the pipe would be a greater risk.
That said, the Hyper maintainers are currently investigating a patch that could mitigate this specific attack.
This is all very good to know, nice work.
Just one question: why is it not easy to trigger 400 client threads over the network against 1 server thread? Seems dead simple and is what every ddos framework can do, no? Am I misinterpreting your answer maybe?
Thanks for all the work from everyone involved again! It's appreciated!
It isn't a matter of threading, but of bandwidth. You can only send so many bytes as the pipe allows, snd that has to be enough to saturate the hyper accept loop.
This is easy enough to do over localhost, where bandwidth is arbitrarily high, not so much over the public internet
why is it not easy to trigger over the network against 1 server thread?
This is our current theory (can't confirm it's what the original reporter meant): if you can flood the network faster than the server is able to "accept" requests, the queue of pending-to-accept-but-reset requests can keep growing. This is because the accept method is basically read_from_socket(); yield_pending_request(). The task in hyper that is "accepting" requests is very tight: it's simply focused on accept(); spawn_task_to_handle_request(). If you could fill up the TCP read buffer again with a bunch of more requests before we did that simple loop, over and over and over, that could keep growing the internal queue.
So, if it is over the network, not local, then latency, backpressure, and TCP flow control would make it harder to fill up the server's buffer faster than it can spawn a task to handle the request.
But again, this is just our best guess.
If that is indeed the issue, this PR means to fix it: https://github.com/hyperium/h2/pull/668
The PR seems close. But, given that I've been at this for 14 hours, I'm going to go knock out. Comparing the severity with the possibility of breaking something just before being unresponsive, it doesn't seem worth publishing right at this moment.
Release PR is in CI right now: https://github.com/hyperium/h2/pull/670