workers-rs icon indicating copy to clipboard operation
workers-rs copied to clipboard

[BUG] WebSocket sends the wrong data

Open paulgb opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What version of workers-rs are you using?

0.0.18

What version of wrangler are you using?

3.6.0

Describe the bug

WebSocket::send_with_bytes reproducibly corrupts outgoing binary data in some situations. I have included a repro where [00 01 02] becomes [65 00 01].

Repro repo: https://github.com/paulgb/cloudflare-repro/tree/main

I been able to reproduce this on Mac and Linux.

Steps To Reproduce

  1. Clone https://github.com/paulgb/cloudflare-repro.git
  2. npm i && npm run dev
  3. open another terminal and cd into client
  4. npm i && npx ts-node ./client.ts

Observe that the output is received: <Buffer 65 00 01> rather than received: <Buffer 01 02 03>.

paulgb avatar Aug 26 '23 17:08 paulgb

I hit the same issue. After some investigation it seems the bug is in web-sys itself, and may be related to how wasm_bindgen passes &[u8] to the Websocket.send() on the js side. This bug doesn't occur if you manually create an array buffer first:

let uint8_array = Uint8Array::from(data.as_slice());
ws.as_ref().send_with_array_buffer(&uint8_array.buffer())?;

carlsverre avatar Sep 01 '23 21:09 carlsverre

This bug is still present in the latest release (0.0.20); binary messages are always corrupted on send.

I applied the patch above to Worker::Websocket::send_with_bytes and can confirm it fixes the corruption, though it would be nice to track down what the root cause is.

eric-seppanen avatar Mar 12 '24 21:03 eric-seppanen

Is this still an issue? I created https://github.com/cloudflare/workers-rs/pull/523 to upstream this patch. Does something similar need to be done to send_with_str?

kflansburg avatar Apr 01 '24 15:04 kflansburg

I last checked on 0.0.20 and unless something has changed in recent weeks then I would expect that it's still a problem. It seems to affect every (binary) message, so it's not hard to reproduce.

My main concern is that we don't understand this bug very well. Is this a problem with web_sys? I am skeptical that could be the case, since it would be broken on web browsers as well.

If WebSocket::send_with_u8_array is only broken in the CloudFlare environment, then why is that? Is there a lower-level bug in the runtime? Is the corrupted data a symptom of reading uninitialized memory or unrelated memory (which could have security implications)? It's a bit disconcerting to paper over the bug without an understanding why it's happening.

If it's impractical to investigate further, then applying #523 would be an improvement.

In my experience send_with_str works fine (no corruption).

eric-seppanen avatar Apr 01 '24 17:04 eric-seppanen

I last checked on 0.0.20 and unless something has changed in recent weeks then I would expect that it's still a problem. It seems to affect every (binary) message, so it's not hard to reproduce.

My main concern is that we don't understand this bug very well. Is this a problem with web_sys? I am skeptical that could be the case, since it would be broken on web browsers as well.

If WebSocket::send_with_u8_array is only broken in the CloudFlare environment, then why is that? Is there a lower-level bug in the runtime? Is the corrupted data a symptom of reading uninitialized memory or unrelated memory (which could have security implications)? It's a bit disconcerting to paper over the bug without an understanding why it's happening.

If it's impractical to investigate further, then applying #523 would be an improvement.

In my experience send_with_str works fine (no corruption).

I'm actually unable to reproduce this locally using the repo shared above. I initially tried the latest version of everything, and then tried the original versions contained in the repo. In every case I get the correct byte sequence output. What OS are you using and could you confirm the {wrangler, miniflare, workerd} versions that are demonstrating this behavior? Thanks!

kflansburg avatar Apr 01 '24 18:04 kflansburg

I made a repository that shows the corruption here: https://github.com/eric-seppanen/websocket_corruption

I initially tried it on wrangler 3.30.1 but I also tried upgrading to wrangler 3.43.0 and the same corruption occurs.

I'm running this on Ubuntu 22.04; I'm not sure how to view the version of miniflare/workerd.

The data corruption happens every time I run it, and the bytes I receive are fairly consistent when running wrangler dev but they are different when I run wrangler dev --remote.

eric-seppanen avatar Apr 02 '24 18:04 eric-seppanen

I sent a binary message containing the string hello world, binary message. The worker receives the bytes, and prints them correctly:

[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 97, 103, 101]

It then calls send_with_bytes on the same Vec<u8> that was just printed. The client receives:

[156, 105, 19, 0, 156, 105, 19, 0, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 32, 0, 0]

Note that the first 8 bytes are corrupted, the next 16 bytes are correct, and the last 3 bytes are corrupted.

If I run with wrangler dev --remote the output changes from one run to the next. For example, I saw these two outputs the last few times I ran it:

[244, 105, 19, 0, 244, 105, 19, 0, 114, 108, 100, 44, 32, 98, 105, 110, 97, 114, 121, 32, 109, 101, 115, 115, 32, 0, 0]
[16, 0, 20, 0, 16, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 108, 104, 19, 0, 0, 0, 0, 0, 32, 0, 0]

They both show similar patterns but the second one looks like the data got overwritten by zeros before the corruption pattern appeared.

I am also seeing server messages that say workerd/server/server.c++:3309: error: Uncaught exception: kj/compat/tls.c++:381: disconnected: worker_do_not_log; Request failed due to internal error but I'm not sure if that's relevant? It seems to happen on websocket close.

eric-seppanen avatar Apr 02 '24 18:04 eric-seppanen

I'm not sure how to view the version of miniflare/workerd.

You should be able to look in node_modules/workerd/bin/workerd --version. Miniflare I guess you may not be using.

kflansburg avatar Apr 03 '24 02:04 kflansburg

I have a few copies of my project laying around, and it looks like I've used a few different versions of workerd: 2023-12-18, 2024-03-20, and the websocket_corruption project I posted above I used 2024-03-29.

eric-seppanen avatar Apr 03 '24 02:04 eric-seppanen

Ok, I'm able to reproduce your new POC.

What we believe is happening is that Rust is assuming that the runtime will fully clone the provided buffer when send is called. Our runtime does not clone as this is generally wasteful, and instead asynchronously sends the original buffer. This behavior may differ in other runtimes, but this is the first time we have found it to be an issue. This would apply in any case where the ArrayBuffer is mutated right after calling send. In JavaScript, this is generally not an issue unless the user explicitly mutates the buffer, as the memory will not be freed due to the runtime holding a strong reference to the buffer. In the case of Wasm, the memory is opaque to the runtime and Rust appears to free the buffer right away after calling send.

I don't believe this is a security issue (only the Wasm linear memory is accessed), but it is probably unintended behavior for those calling web_sys::Websocket.send_with_u8_array directly.

In the case of workers-rs, I believe the patch that you suggested is a good fix because it explicitly forces this clone into a JavaScript object.

cc @kentonv

kflansburg avatar Apr 03 '24 15:04 kflansburg

Thanks for investigating this. It's an interesting bug!

If I understand correctly, the runtime is reading that memory concurrently with the wasm code freeing and reusing it.

What protects us from this exact bug coming back? (I assume future maintainers may not know the history of this problem.) Should there be a big comment next to any code that transfers a buffer to a websocket? This should set off alarms to rust engineers, who have become accustomed to not needing such load-bearing comments in safe code.

What if someone adds another send function in the future? How do they know what is allowed? Normally we could rely on the rust type system to uphold ownership and lifetime guarantees, but if the runtime and web_sys can't agree on what those guarantees are, then the compiler can't protect us, which is a big reason to use Rust in the first place.

I doubt that the optimizer would learn how to elide unnecessary-looking memory copies in FFI interfaces, but if I'm wrong, could this fail again on a newer rust toolchain?

Are there other places where the runtime makes this assumption, where lifetime/ownership assumptions are different from other runtimes (and from web_sys)?

If there are widespread disagreements over memory ownership or lifetimes between web_sys and the workers runtime, then it seems like the robust fix would be for one or the other to change so that they agree on the lifetime of that memory. Maybe the worker crate should fork the affected web_sys modules?

eric-seppanen avatar Apr 03 '24 19:04 eric-seppanen

Im not sure that I’m going to be able to answer all of these questions here. I will audit places that ‘&[u8]’ is passed to JS and add comments.

kflansburg avatar Apr 03 '24 19:04 kflansburg

I will audit places that ‘&[u8]’ is passed to JS

What about methods in web_sys that aren't part of the workers crate but might be used directly by wasm code?

For example, If I use methods on SubtleCrypto, do I need to worry about similar issues? Lots of those methods accept &[u8] parameters. (Not theoretical: I am doing this in my worker code today.)

eric-seppanen avatar Apr 03 '24 19:04 eric-seppanen

I suppose SubtleCrypto is a bad example, because when it returns it has completed its task. Websocket send is sort of unique in that it has ongoing work to do in the background, after the call returns. There probably aren't that many places where that happens.

eric-seppanen avatar Apr 03 '24 19:04 eric-seppanen

A possible (hacky) solution: We could change the runtime so that it automatically clones buffers any time that the passed-in buffer is a view into a larger memory area (such as Wasm memory). This is a strong hint that the caller might plan to reuse the buffer for other things and so reading it asynchronously will lead to trouble.

kentonv avatar Apr 15 '24 19:04 kentonv