workerd icon indicating copy to clipboard operation
workerd copied to clipboard

🐛 wrangler dev error: Received signal #11: Segmentation fault: 11

Open ds300 opened this issue 2 years ago • 26 comments

Hello!

Recently we've been seeing this issue with one of our worker scripts (we have three, the other two seem fine) where after some time being used in dev mode it will segfault and become unresponsive. This seems to happen anywhere from 10 seconds to 10 minutes after booting. I have not been able to find a way to reproduce it on demand, but it will happen every time, at some point.

This is on MacOS, using M1 chips.

The full error that is printed when this happens is (--log-level debug)

*** Received signal #11: Segmentation fault: 11
stack: 

There are always other errors logged out above, but they seem to change every time and are maybe unrelated since they seem to happen all the time regardless of how long it takes to hit the segfault. Here's an example (again --log-level debug)

workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 10529a004 10506a108 10506a517 104f40d78
workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 10529a004 10506a108 10506a517 104f40d78
workerd/io/worker-entrypoint.c++:218: info: exception = kj/compat/http.c++:2673: disconnected: WebSocket disconnected between frames without sending `Close`.
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 105824d97 10689dd88 105842d8c 10529a004 10506a108 10506a517 104f40d78
workerd/server/server.c++:2838: error: Uncaught exception: kj/compat/http.c++:2673: disconnected: worker_do_not_log; Request failed due to internal error
stack: 105822c40 10689dd88 105824bd8 105824d97 10689dd88 105842d8c 105824bd8 105824d97 10689dd88 105842d8c 10529a004 10506a108 10506a517 104f40d78 104a85c38 10584ebdc 10584fb84 10

workerd/io/worker-entrypoint.c++:218: info: exception = kj/async.c++:220: disconnected: other end of WebSocketPipe was destroyed

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = P��-p��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = `�-(��@��-$(���|�.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = �

stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104

workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component

stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = 0�- -.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = (o� L2
P�-�0��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:306: failed: expected part.findFirst('/') == kj::none [(can't stringify) == (can't stringify)]; '/' character in path component; did you mean to use Path::parse()?; part = http://fake-host/w/27a79965-642c-44eb-8bd4-0e857bb0554e/presence.sqlite
stack: 1068e91a7 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = (o� L2
P�-�0��-��-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = P��-`�-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [0 == 71]; NUL character in path component; part = �ٙ-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [5 == 71]; NUL character in path component; part = @\�-�ٙ-.sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104f51d07 104f5189f 104a6c59b 104a694eb 104a7143f 104a70f77 104a841c8 10529d3ec 10529e930 104d2f034 104d2f38c
workerd/server/alarm-scheduler.c++:203: warning: exception = kj/filesystem.c++:304: failed: expected strlen(part.begin()) == part.size() [0 == 71]; NUL character in path component; part = .sqlite
stack: 1068e913b 1068e9357 104a6d39f 104a6d26f 104f72ddb 104f729a7 104
*** Received signal #11: Segmentation fault: 11
stack:

Let me know if any other info I can provide would be helpful.

For now I'm going to try to wrap the wrangler dev process in a retry loop.

ds300 avatar Nov 21 '23 14:11 ds300

Also seeing this (M1), but I don't use alarms at all, so I don't get any of the additional logs you provided.

Maybe it's just something with Durable Objects in general? I'm also using Hibernatable Websockets if that helps narrow it down.

I thought I was going crazy 😅

thecatanon avatar Nov 23 '23 05:11 thecatanon

We are also using the socket hibernation api.

ds300 avatar Nov 23 '23 10:11 ds300

in case anybody else needs it, here is the script I wrote to wrap wrangler and restart it after a segfault.

dev.ts

import { ChildProcessWithoutNullStreams, spawn } from 'child_process'
import stripAnsi from 'strip-ansi'

class WranglerMonitor {
	private process: ChildProcessWithoutNullStreams | null = null

	public start(): void {
		this.stop() // Ensure any existing process is stopped
		console.log(`Starting wrangler...`)
		this.process = spawn('wrangler', ['dev', '--env', 'dev'], {
			env: {
				NODE_ENV: 'development',
				...process.env,
			},
		})

		this.process.stdout.on('data', (data: Buffer) => {
			this.handleOutput(stripAnsi(data.toString().replace('\r', '').trim()))
		})

		this.process.stderr.on('data', (data: Buffer) => {
			this.handleOutput(stripAnsi(data.toString().replace('\r', '').trim()), true)
		})
	}

	private handleOutput(output: string, err = false): void {
		if (!output) return
		if (output.includes('Segmentation fault')) {
			console.error('Segfault detected. Restarting Wrangler...')
			this.restart()
		} else if (!err) {
			console.log(output.replace('[mf:inf]', ''))
		}
	}

	private restart(): void {
		console.log('Restarting wrangler...')
		this.stop()
		setTimeout(() => this.start(), 100) // Restart after a short delay
	}

	private stop(): void {
		if (this.process) {
			this.process.kill()
			this.process = null
		}
	}
}

new WranglerMonitor().start()

Run it with npx tsx <filename>. bun might work too.

There's probably a bash one-liner that can do the same thing but i am not bashfully gifted.

ds300 avatar Nov 23 '23 10:11 ds300

This looks like the same bug as #1386.

kentonv avatar Nov 27 '23 19:11 kentonv

This looks like the same bug as #1386.

Unfortunately still seeing this on v1.20231206.0 which closes that issue, but I'm also not using alarms. Once I have some spare time I'll get a minimum reproduction and more detailed system stats, as it happens extremely often for me.

thecatanon avatar Dec 09 '23 04:12 thecatanon

I've spent way too much time trying to get a reproduction that is consistent - there isn't one.

Here's a log from one of the crashes, hopefully I left enough unredacted: crash.log

This appears to only be affecting M1 chips, and one of the contributing factors is this.state.acceptWebSocket in a Durable Object. Removing it from our durable objects seem to stop the segfaults (but with this, who knows, honestly)

However, we have other projects that do not segfault. This leads us to believe it could be because of

  • Script size
  • The amount of bindings we use (Available in the crash log)
  • The code of the Durable Object that powers the WebSocket (we call D1 before accepting the WebSocket, for example)

Here's the code of the Durable Object I'm primarily testing with:

export class MyDurableObject {
	env: CloudflareEnv;
	state: DurableObjectState;

	constructor(state: DurableObjectState, env: CloudflareEnv) {
		const e = envSchema.safeParse(env);
		if (!e.success) throw new Error(...);
		const parsedEnv = {
			...env,
			...e.data
		};
		this.env = parsedEnv;
		this.state = state;
	}

	async fetch(request: Request) {
		const ctx = getContext(request) // D1 calls, etc
		if (isWebSocket(request)) {
			console.log('GOT WEBSOCKET REQUEST');
			if (!ctx.user) return new Response('NOUSER', { status: 404 });

			if (this.state.id.toString() !== ctx.live.idFromName(ctx.user.id).toString()) {
				return new Response('', { status: 404 });
			}

			const webSocketPair = new WebSocketPair();
			const [client, server] = Object.values(webSocketPair);
			this.state.acceptWebSocket(server);

			return new Response(null, {
				status: 101,
				webSocket: client
			});
		}

		console.log('SENDING USER MESSAGE');

		try {
			const data = await request.json();
			for (const socket of this.state.getWebSockets()) {
				socket.send(JSON.stringify(data));
			}
			return new Response('Hello World');
		} catch (e) {
			console.error(e);
			return new Response('Unknown Error');
		}
	}

	async webSocketClose(ws: WebSocket, code: number, reason: string, wasClean: boolean) {
		console.log('CLOSED', { ws, code, reason, wasClean });
	}

	async webSocketError(ws: WebSocket, error: unknown) {
		console.error('ERROR', error);
	}
}

This has become such a frustrating issue for my team that we've sunk way too many hours into diagnosing and are desperately looking for a solution. We hope this issue gets the attention it deserves.

thecatanon avatar Jan 05 '24 06:01 thecatanon

Hey @thecatanon! 👋 Thanks for trying to put together a minimal reproduction. Unfortunately, the stacks from the release builds aren't very useful for debugging these kinds of issues. I've compiled an M1 debug build with ASAN enabled from 8c47f13 for you to try: https://drive.google.com/file/d/1OVMhhzU9HltFLyde9T1sg74UNaNn9w1S/view?usp=sharing. To use this build, unzip the file, then run wrangler dev with the MINIFLARE_WORKERD_PATH environment variable set to the absolute path of the unzipped workerd-asan-8c47f13 binary. This build will be significantly slower, but hopefully if the segfault occurs, it will log exactly where the problem is. 🤞

mrbbot avatar Jan 05 '24 10:01 mrbbot

Thanks @mrbbot!

Unfortunately I have just a few (unhelpful) findings:

  • With that build, I receive a Received signal #10: Bus error: 10 stack: (Nothing after the "stack:") when writing an HTML response.

  • If I write a HTML response from the regular build (without the custom MINIFLARE_WORKERD_PATH), load the page, exit wrangler, and then restart with the debug build so the websocket connects to the debug build, it seems to never segfault.

Maybe that bus error is another manifestation of the same issue?

buserror.log

CleanShot 2024-01-05 at 13 09 48@2x

It actually will load the HTML, but crash immediately after. Could it be a caching thing?

EDIT: Just created my own ASAN build and received the same bus error.

thecatanon avatar Jan 05 '24 17:01 thecatanon

I seem to be able to reliably trigger the error when using the Hibernating Sockets API. When a socket disconnects and then the Durable Object is evicted it triggers the error. Although I'm only assuming it's the Durable Object eviction as it occurs 10 seconds after the last web socket message which I believe is the durable object eviction timer.

DylanRJohnston avatar Jan 13 '24 07:01 DylanRJohnston

@DylanRJohnston

Can you provide a repro? When do you see the segfault? Also are you on M1 mac as well?

MellowYarker avatar Jan 13 '24 15:01 MellowYarker

Yeah that might be a bit tricky given I'm using Nix, Rust, and wasm-bindgen to build the project, I'm on an M2 Macbook Pro, but still Apple Silicon.

DylanRJohnston avatar Jan 16 '24 09:01 DylanRJohnston

Could this be related to #1535 / #1299?

thecatanon avatar Jan 17 '24 03:01 thecatanon

@thecatanon I thought it might be which is why I tried closing out that bug, but I suspect it's unrelated. I don't see how that would have caused a segfault.

MellowYarker avatar Jan 17 '24 16:01 MellowYarker

@MellowYarker Yeah still seeing it in the latest release

Will attempt creating an ASAN build again (got a bus error whenever HTML responses were delivered last time)

thecatanon avatar Jan 17 '24 21:01 thecatanon

still seeing it in the latest release

@thecatanon are you referring to the latest release of Workerd or Miniflare (which I think is still pointing to a prev version of Workerd)? I'm not familiar with how Miniflare works but it seems like earlier in this thread you were manually changing the dep to point to a specific version, so I figure you may have already changed it to point to the latest Workerd?

MellowYarker avatar Jan 18 '24 00:01 MellowYarker

@MellowYarker Yeah, using the MINIFLARE_WORKERD_PATH environment variable (I assume wrangler uses miniflare somehow under the hood, which in-turn looks at that variable to figure out how to launch workerd)

thecatanon avatar Jan 18 '24 00:01 thecatanon

Additionally, I might be seeing this in production now (saw someone get a disconnected screen, went to reproduce with Real-Time logs enabled, saw it happen, and found "outcome": "canceled" as well as "outcome":"exception" on a different one)

Could be unrelated, but thought it was worth adding.

"Cancelled" CF-Ray: 8472bc7d8a5236c0 "Exception" CF-Ray: 8472d052a8b736fd

(Hope that helps - happy to share any other details privately on Discord, I'm in the CF Developers server as the.feline)

thecatanon avatar Jan 18 '24 00:01 thecatanon

For what it's worth, we're seeing this a lot in reflect.net too. I have not been able to get a reproduction but will keep trying.

aboodman avatar Jan 22 '24 20:01 aboodman

I started to see this only after switching to Hibernatable WebSockets. Pretty much no other changes, so this see4ms very much related. FWIW we're using both regular and hibernatable sockets in the same DO:

workerd/io/worker-entrypoint.c++:340: info: exception = kj/compat/http.c++:4001: disconnected: other end of WebSocketPipe was destroyed
stack: 1028788cc 1016432d0 1010a8577 100e6d658 100e6da67 10287433f 100d3fee4
✘ [ERROR] *** Received signal #11: Segmentation fault: 11

  stack:


[InspectorProxyWorker] RUNTIME WEBSOCKET CLOSED 1006 WebSocket disconnected without sending Close frame.
workerd/util/symbolizer.c++:96: warning: Not symbolizing stack traces because $LLVM_SYMBOLIZER is not set. To symbolize stack traces, set $LLVM_SYMBOLIZER to the location of the llvm-symbolizer binary. When running tests under bazel, use `--test_env=LLVM_SYMBOLIZER=<path>`.

Every subsequent WebSocket connection is rejected with

workerd/jsg/util.c++:283: info: exception = kj/async-io-unix.c++:1524: disconnected: connect(): Connection refused
stack: 1021c5cfb 1021c60d7 10218833f 1021bebc4 1021bf4d0 1021b2bb8 1021b3000 100f7ba9b 100f7c38b 100f7da87 100f5c520 100f62490 100f784f0 10014f8cc 1007d0b7c

and for GET requests I see

workerd/io/worker.c++:1764: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
workerd/io/io-context.c++:359: info: uncaught exception; exception = workerd/jsg/_virtual_includes/jsg/workerd/jsg/value.h:1334: failed: jsg.Error: Network connection lost.
stack: 10078145c 10078223b 10218833f 100651f1c 1006524dc
workerd/io/worker.c++:1764: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
workerd/io/io-context.c++:359: info: uncaught exception; exception = workerd/jsg/_virtual_includes/jsg/workerd/jsg/value.h:1334: failed: jsg.Error: Network connection lost.
stack: 10078145c 10078223b 10218833f 100651f1c 1006524dc
✘ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

I have to restart wrangler dev to get it to work again.

codefrau avatar Feb 13 '24 01:02 codefrau

@codefrau do you have a repro we could take a look at? Are you also on mac?

MellowYarker avatar Feb 13 '24 17:02 MellowYarker

Just for tracking purposes (in case Cloudflare can't reproduce this on their end), still seeing this on v1.20240208.0.

Additionally, when creating ASAN builds (could absolutely be doing this wrong, I'm no C++ dev), when HTML is delivered I receive Error connecting. restart in the logs (as well as a crash). This is similar to the bus error described here: https://github.com/cloudflare/workerd/issues/1422#issuecomment-1878985744

thecatanon avatar Feb 13 '24 17:02 thecatanon

@codefrau do you have a repro we could take a look at? Are you also on mac?

The repo is not public unfortunately.

But yes, I'm on a Mac (2021 MacBook Pro with M1 Max, running Sonoma 14.2.1)

wrangler 3.28.1, node v20.11.0

codefrau avatar Feb 13 '24 18:02 codefrau

So this is curious: if it happens, then the segfault always happens at about 5 min after the last regular web socket disconnected and we sent the last message through the single hibernateble socket remaining. I looked through more logs than this but it's always ~5 minutes, never less than 4, never more than 6. There is no 5 min timeout in our code, and there was no other request. We only have an auto-response, the client pings every 30s.

Feb 13 17:52:14 last non-hibernateable websocket closed / last message sent on hibernatable
          +5:02
Feb 13 17:57:16 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 18:28:17 last non-hibernateable websocket closed / last message sent on hibernatable
          +5:13   
Feb 13 18:33:30 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 19:28:06 last non-hibernateable websocket closed / last message sent on hibernatable
          +4:56
Feb 13 19:33:02 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11
...
Feb 13 20:41:13 last non-hibernateable websocket closed / last message sent on hibernatable
          +4:12
Feb 13 20:45:25 ✘ [ERROR] *** Received signal #11: Segmentation fault: 11

codefrau avatar Feb 14 '24 05:02 codefrau

Upvoted, +1! This one is particularly annoying, as there doesn't seem to be a workaround, and it's making unit testing waking up from hibernation impossible for us.

I've been able to make a minimal reliable replication of this crash: https://github.com/nvie/cloudflare-repros/tree/segfault#readme

It uses a Durable Object with the WebSocket Hibernation API, and pretty much nothing else.

I'm on an M3 Max, on Sonoma 14.4.1.

~Update from April 15, 2024: Still an issue, after using all the latest versions of all Cloudflare packages.~ Update from April 17, 2024: Still an issue, after using all the latest versions of all Cloudflare packages.

nvie avatar Apr 05 '24 15:04 nvie

+1 on this issue, extremely irritating and difficult to do development with wrangler on hibernateable sockets

matt384 avatar Apr 10 '24 22:04 matt384

@nvie thanks for the repro! I see a segfault on linux too (very surprising tbh).

It looks like it's segfaulting in the ActorContainerRef destructor. Not sure why and will need to look into it further. I'll give this a proper look over the next week or two.

MellowYarker avatar Apr 11 '24 00:04 MellowYarker

Let me know if this is still a problem after the next release goes out. It looks like we aren't putting one out this week, so I suspect it should go out next week.

Edit: Looks like a release has gone out now. Just need to wait for Miniflare to catch up.

MellowYarker avatar Apr 19 '24 14:04 MellowYarker

@nvie I'm running your repro with [email protected] and [email protected] and I'm not seeing a crash or segfault, can you confirm it's fixed for you too?

MellowYarker avatar Apr 24 '24 15:04 MellowYarker

Yes! This is amazing — thanks so much, @MellowYarker! 🙌

nvie avatar Apr 24 '24 19:04 nvie

Closing for now as this seems to have been resolved. Feel free to reopen if still experiencing a segfault after [email protected] and [email protected]. Thanks all for the helpful comments + @nvie for the reproduction!

MellowYarker avatar Apr 24 '24 19:04 MellowYarker