workers-rs
workers-rs copied to clipboard
[BUG] WebSocket sends the wrong data
Is there an existing issue for this?
- [X] I have searched the existing issues
What version of workers-rs
are you using?
0.0.18
What version of wrangler
are you using?
3.6.0
Describe the bug
WebSocket::send_with_bytes
reproducibly corrupts outgoing binary data in some situations. I have included a repro where [00 01 02]
becomes [65 00 01]
.
Repro repo: https://github.com/paulgb/cloudflare-repro/tree/main
I been able to reproduce this on Mac and Linux.
Steps To Reproduce
- Clone
https://github.com/paulgb/cloudflare-repro.git
-
npm i && npm run dev
- open another terminal and
cd
intoclient
-
npm i && npx ts-node ./client.ts
Observe that the output is received: <Buffer 65 00 01>
rather than received: <Buffer 01 02 03>
.
I hit the same issue. After some investigation it seems the bug is in web-sys itself, and may be related to how wasm_bindgen passes &[u8]
to the Websocket.send()
on the js side. This bug doesn't occur if you manually create an array buffer first:
let uint8_array = Uint8Array::from(data.as_slice());
ws.as_ref().send_with_array_buffer(&uint8_array.buffer())?;
This bug is still present in the latest release (0.0.20); binary messages are always corrupted on send.
I applied the patch above to Worker::Websocket::send_with_bytes
and can confirm it fixes the corruption, though it would be nice to track down what the root cause is.
Is this still an issue? I created https://github.com/cloudflare/workers-rs/pull/523 to upstream this patch. Does something similar need to be done to send_with_str
?
I last checked on 0.0.20 and unless something has changed in recent weeks then I would expect that it's still a problem. It seems to affect every (binary) message, so it's not hard to reproduce.
My main concern is that we don't understand this bug very well. Is this a problem with web_sys
? I am skeptical that could be the case, since it would be broken on web browsers as well.
If WebSocket::send_with_u8_array
is only broken in the CloudFlare environment, then why is that? Is there a lower-level bug in the runtime? Is the corrupted data a symptom of reading uninitialized memory or unrelated memory (which could have security implications)? It's a bit disconcerting to paper over the bug without an understanding why it's happening.
If it's impractical to investigate further, then applying #523 would be an improvement.
In my experience send_with_str
works fine (no corruption).
I last checked on 0.0.20 and unless something has changed in recent weeks then I would expect that it's still a problem. It seems to affect every (binary) message, so it's not hard to reproduce.
My main concern is that we don't understand this bug very well. Is this a problem with
web_sys
? I am skeptical that could be the case, since it would be broken on web browsers as well.If
WebSocket::send_with_u8_array
is only broken in the CloudFlare environment, then why is that? Is there a lower-level bug in the runtime? Is the corrupted data a symptom of reading uninitialized memory or unrelated memory (which could have security implications)? It's a bit disconcerting to paper over the bug without an understanding why it's happening.If it's impractical to investigate further, then applying #523 would be an improvement.
In my experience
send_with_str
works fine (no corruption).
I'm actually unable to reproduce this locally using the repo shared above. I initially tried the latest version of everything, and then tried the original versions contained in the repo. In every case I get the correct byte sequence output. What OS are you using and could you confirm the {wrangler, miniflare, workerd} versions that are demonstrating this behavior? Thanks!
I made a repository that shows the corruption here: https://github.com/eric-seppanen/websocket_corruption
I initially tried it on wrangler 3.30.1 but I also tried upgrading to wrangler 3.43.0 and the same corruption occurs.
I'm running this on Ubuntu 22.04; I'm not sure how to view the version of miniflare/workerd.
The data corruption happens every time I run it, and the bytes I receive are fairly consistent when running wrangler dev
but they are different when I run wrangler dev --remote
.
I sent a binary message containing the string hello world, binary message
. The worker receives the bytes, and prints them correctly:
[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 97, 103, 101]
It then calls send_with_bytes
on the same Vec<u8>
that was just printed. The client receives:
[156, 105, 19, 0, 156, 105, 19, 0, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 32, 0, 0]
Note that the first 8 bytes are corrupted, the next 16 bytes are correct, and the last 3 bytes are corrupted.
If I run with wrangler dev --remote
the output changes from one run to the next. For example, I saw these two outputs the last few times I ran it:
[244, 105, 19, 0, 244, 105, 19, 0, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 32, 0, 0]
[16, 0, 20, 0, 16, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 108, 104, 19, 0, 0, 0, 0, 0, 32, 0, 0]
They both show similar patterns but the second one looks like the data got overwritten by zeros before the corruption pattern appeared.
I am also seeing server messages that say workerd/server/server.c++:3309: error: Uncaught exception: kj/compat/tls.c++:381: disconnected: worker_do_not_log; Request failed due to internal error
but I'm not sure if that's relevant? It seems to happen on websocket close.
I'm not sure how to view the version of miniflare/workerd.
You should be able to look in node_modules/workerd/bin/workerd --version
. Miniflare I guess you may not be using.
I have a few copies of my project laying around, and it looks like I've used a few different versions of workerd: 2023-12-18
, 2024-03-20
, and the websocket_corruption
project I posted above I used 2024-03-29
.
Ok, I'm able to reproduce your new POC.
What we believe is happening is that Rust is assuming that the runtime will fully clone the provided buffer when send
is called. Our runtime does not clone as this is generally wasteful, and instead asynchronously sends the original buffer. This behavior may differ in other runtimes, but this is the first time we have found it to be an issue. This would apply in any case where the ArrayBuffer is mutated right after calling send
. In JavaScript, this is generally not an issue unless the user explicitly mutates the buffer, as the memory will not be freed due to the runtime holding a strong reference to the buffer. In the case of Wasm, the memory is opaque to the runtime and Rust appears to free the buffer right away after calling send
.
I don't believe this is a security issue (only the Wasm linear memory is accessed), but it is probably unintended behavior for those calling web_sys::Websocket.send_with_u8_array
directly.
In the case of workers-rs
, I believe the patch that you suggested is a good fix because it explicitly forces this clone into a JavaScript object.
cc @kentonv
Thanks for investigating this. It's an interesting bug!
If I understand correctly, the runtime is reading that memory concurrently with the wasm code freeing and reusing it.
What protects us from this exact bug coming back? (I assume future maintainers may not know the history of this problem.) Should there be a big comment next to any code that transfers a buffer to a websocket? This should set off alarms to rust engineers, who have become accustomed to not needing such load-bearing comments in safe code.
What if someone adds another send function in the future? How do they know what is allowed? Normally we could rely on the rust type system to uphold ownership and lifetime guarantees, but if the runtime and web_sys
can't agree on what those guarantees are, then the compiler can't protect us, which is a big reason to use Rust in the first place.
I doubt that the optimizer would learn how to elide unnecessary-looking memory copies in FFI interfaces, but if I'm wrong, could this fail again on a newer rust toolchain?
Are there other places where the runtime makes this assumption, where lifetime/ownership assumptions are different from other runtimes (and from web_sys
)?
If there are widespread disagreements over memory ownership or lifetimes between web_sys
and the workers runtime, then it seems like the robust fix would be for one or the other to change so that they agree on the lifetime of that memory. Maybe the worker
crate should fork the affected web_sys
modules?
Im not sure that I’m going to be able to answer all of these questions here. I will audit places that ‘&[u8]’ is passed to JS and add comments.
I will audit places that ‘&[u8]’ is passed to JS
What about methods in web_sys
that aren't part of the workers
crate but might be used directly by wasm code?
For example, If I use methods on SubtleCrypto
, do I need to worry about similar issues? Lots of those methods accept &[u8]
parameters. (Not theoretical: I am doing this in my worker code today.)
I suppose SubtleCrypto
is a bad example, because when it returns it has completed its task. Websocket send is sort of unique in that it has ongoing work to do in the background, after the call returns. There probably aren't that many places where that happens.
A possible (hacky) solution: We could change the runtime so that it automatically clones buffers any time that the passed-in buffer is a view into a larger memory area (such as Wasm memory). This is a strong hint that the caller might plan to reuse the buffer for other things and so reading it asynchronously will lead to trouble.