"httpz: quick shutdown" test fails
Zig version: 0.14.0 / 0.14.1 OS: Linux (Fedora Silverblue 42, x86_64)
Log:
================================================================================
panic running "httpz: quick shutdown"
================================================================================
thread 94307 panic: reached unreachable code
/var/home/chris/.local/share/zigup/0.14.1/files/lib/std/posix.zig:4063:18: 0x128492b in epoll_ctl (test)
.BADF => unreachable, // always a race condition if this happens
^
/var/home/chris/Projects/http.zig/src/worker.zig:1223:39: 0x1284a25 in monitorAccept (test)
return std.posix.epoll_ctl(self.fd, linux.EPOLL.CTL_ADD, fd, &event);
^
/var/home/chris/Projects/http.zig/src/worker.zig:529:36: 0x125de4d in run (test)
self.loop.monitorAccept(listener) catch |err| {
^
/var/home/chris/.local/share/zigup/0.14.1/files/lib/std/Thread.zig:488:13: 0x1239550 in callFn__anon_39030 (test)
@call(.auto, f, args);
^
/var/home/chris/.local/share/zigup/0.14.1/files/lib/std/Thread.zig:757:30: 0x1204d14 in entryFn (test)
return callFn(f, args_ptr.*);
^
???:?:?: 0x7f5d49c4a1d3 in ??? (libc.so.6)
Unwind information for `libc.so.6:0x7f5d49c4a1d3` was not available, trace may be incomplete
???:?:?: 0x7f5d49cccceb in ??? (libc.so.6)
test
└─ run test failure
error: the following command terminated unexpectedly:
/var/home/chris/Projects/http.zig/.zig-cache/o/24cbbc542d024027461df00f6f10fa09/test
Build Summary: 3/5 steps succeeded; 1 failed
test transitive failure
└─ run test failure
error: the following build command failed with exit code 1:
/var/home/chris/Projects/http.zig/.zig-cache/o/826d4dc5e59b1fd2058e052db7dc10a7/build /var/home/chris/.local/share/zigup/0.14.1/files/zig /var/home/chris/.local/share/zigup/0.14.1/files/lib /var/home/chris/Projects/http.zig /var/home/chris/Projects/http.zig/.zig-cache /var/home/chris/.cache/zig --seed 0x2fdfdb14 -Z6e76cf2baf12a802 test
Shutdown is tricky :( But I think it's pretty clear what's happening from that stacktrace. I've tried to synchronize things, but ya, the listening socket can get closed before the monitor is setup on it, which would cause this.
I can't reproduce it though, so would you mind trying out this branch ?
That test passes on that branch; I do get a new failure though:
debug(websocket): (127.0.0.1:50654) received text message
warning(websocket): (127.0.0.1:50654) error reading from connection: error.WouldBlock
================================================================================
"websocket: upgrade" - ConnectionResetByPeer
================================================================================
General protection exception (no address available)
???:?:?: 0x141ab30 in ??? ()
aborting due to recursive panic
test
└─ run test failure
error: the following command terminated unexpectedly:
/var/home/chris/Projects/http.zig/.zig-cache/o/11f0d4af1af6e403d475aefe8e0063d5/test
Build Summary: 3/5 steps succeeded; 1 failed
test transitive failure
└─ run test failure
error: the following build command failed with exit code 1:
/var/home/chris/Projects/http.zig/.zig-cache/o/826d4dc5e59b1fd2058e052db7dc10a7/build /var/home/chris/.local/share/zigup/0.14.1/files/zig /var/home/chris/.local/share/zigup/0.14.1/files/lib /var/home/chris/Projects/http.zig /var/home/chris/Projects/http.zig/.zig-cache /var/home/chris/.cache/zig --seed 0x616ecdfd -Z46c016b04d7385a4 test
Full log
❯ zig build test
info(websocket): received shutdown signal
httpz: quick shutdown (11.80ms)
httpz: invalid request (0.21ms)
httpz: invalid request path (0.09ms)
httpz: invalid header name (0.07ms)
httpz: invalid content length value (1) (0.08ms)
httpz: invalid content length value (2) (0.07ms)
httpz: body too big (0.12ms)
httpz: overflow content length (0.08ms)
httpz: no route (43.01ms)
httpz: no route with custom notFound handler (43.98ms)
warning: httpz: unhandled exception for request: /fail
Err: error.Failure
httpz: unhandled exception (42.96ms)
httpz: unhandled exception with custom error handler (43.01ms)
httpz: custom methods (216.00ms)
httpz: route params (42.96ms)
httpz: request and response headers (43.99ms)
httpz: content-length body (42.98ms)
httpz: json response (42.99ms)
httpz: query (42.98ms)
httpz: chunked (42.99ms)
httpz: route-specific dispatcher (42.98ms)
httpz: middlewares (44.48ms)
httpz: CORS (176.64ms)
httpz: router groups (262.34ms)
httpz: event stream (43.60ms)
httpz: event stream sync (43.02ms)
httpz: keepalive (86.88ms)
httpz: route data (43.51ms)
httpz: keepalive with explicit write (85.42ms)
httpz: request in chunks (52.98ms)
httpz: writer re-use (436.81ms)
httpz: custom dispatch without action context (44.15ms)
httpz: custom dispatch with action context (43.98ms)
httpz: custom handle (44.99ms)
httpz: request body reader (736.39ms)
websocket: invalid request (44.68ms)
debug(websocket): (127.0.0.1:50654) received text message
warning(websocket): (127.0.0.1:50654) error reading from connection: error.WouldBlock
================================================================================
"websocket: upgrade" - ConnectionResetByPeer
================================================================================
General protection exception (no address available)
???:?:?: 0x141ab30 in ??? ()
aborting due to recursive panic
test
└─ run test failure
error: the following command terminated unexpectedly:
/var/home/chris/Projects/http.zig/.zig-cache/o/11f0d4af1af6e403d475aefe8e0063d5/test
Build Summary: 3/5 steps succeeded; 1 failed
test transitive failure
└─ run test failure
error: the following build command failed with exit code 1:
/var/home/chris/Projects/http.zig/.zig-cache/o/826d4dc5e59b1fd2058e052db7dc10a7/build /var/home/chris/.local/share/zigup/0.14.1/files/zig /var/home/chris/.local/share/zigup/0.14.1/files/lib /var/home/chris/Projects/http.zig /var/home/chris/Projects/http.zig/.zig-cache /var/home/chris/.cache/zig --seed 0x616ecdfd -Z46c016b04d7385a4 test
"httpz: quick shutdown" test fails for me ~25% of the time, "websocket: upgrade" is infrequent but fails on the main branch as well. I'm using linux on x86_64 for reference.
Building with tsan shows a lot of data races (it should probably be used by default)
--- a/build.zig
+++ b/build.zig
@@ -24,6 +24,7 @@ pub fn build(b: *std.Build) !void {
{
const tests = b.addTest(.{
.root_source_file = b.path("src/httpz.zig"),
+ .sanitize_thread = true,
.target = target,
.optimize = optimize,
.test_runner = .{ .path = b.path("test_runner.zig"), .mode = .simple },
Here is one of them for eg.
WARNING: ThreadSanitizer: data race (pid=1244955)
Write of size 8 at 0x7f6f1c900098 by thread T210:
#0 __tsan_memset /home/a/.local/lib/zig/tsan/tsan_interceptors_memintrinsics.cpp:35 (test+0x4a79c9)
#1 heap.memory_pool.MemoryPoolExtra(server.server.HandlerConn(httpz.TestWebsocketHandler.WebsocketHandler),.{ .alignment = null, .growable = true }).destroy /home/a/.local/lib/zig/std/heap/memory_pool.zig:137 (test+0x2820fb)
#2 server.server.ConnManager(httpz.TestWebsocketHandler.WebsocketHandler,false).cleanup /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1217 (test+0x2461b2)
#3 server.server.NonBlockingBase(httpz.TestWebsocketHandler.WebsocketHandler,false).cleanupConn /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:772 (test+0x1ff307)
#4 server.server.Worker(httpz.TestWebsocketHandler.WebsocketHandler).cleanupConn /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:994 (test+0x1eb9ba)
#5 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).processWebsocketData /home/a/Projects/zig/http.zig/src/worker.zig:824 (test+0x33d178)
#6 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).processData /home/a/Projects/zig/http.zig/src/worker.zig:779 (test+0x31fd3b)
#7 thread_pool.Worker((function 'processData')).run.6 /home/a/Projects/zig/http.zig/src/thread_pool.zig:244 (test+0x2e4427)
#8 Thread.callFn__anon_40830 /home/a/.local/lib/zig/std/Thread.zig:488 (test+0x2b4adc)
#9 Thread.PosixThreadImpl.spawn__anon_37126.Instance.entryFn /home/a/.local/lib/zig/std/Thread.zig:757 (test+0x285ec4)
Previous atomic read of size 1 at 0x7f6f1c900098 by thread T209:
#0 server.server.Conn.isClosed /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1310 (test+0x32d363)
#1 server.server._handleClientData__anon_53040 /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1713 (test+0x374354)
#2 server.server.handleClientData__anon_52135 /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1582 (test+0x36270e)
#3 server.server.NonBlockingBase(httpz.TestWebsocketHandler.WebsocketHandler,false)._dataAvailable /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:752 (test+0x362502)
#4 server.server.NonBlockingBase(httpz.TestWebsocketHandler.WebsocketHandler,false).dataAvailable /home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:738 (test+0x3539af)
#5 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).processWebsocketData /home/a/Projects/zig/http.zig/src/worker.zig:821 (test+0x33d0ce)
#6 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).processData /home/a/Projects/zig/http.zig/src/worker.zig:779 (test+0x31fd3b)
#7 thread_pool.Worker((function 'processData')).run.6 /home/a/Projects/zig/http.zig/src/thread_pool.zig:244 (test+0x2e4427)
#8 Thread.callFn__anon_40830 /home/a/.local/lib/zig/std/Thread.zig:488 (test+0x2b4adc)
#9 Thread.PosixThreadImpl.spawn__anon_37126.Instance.entryFn /home/a/.local/lib/zig/std/Thread.zig:757 (test+0x285ec4)
Thread T210 (tid=1245166, running) created by thread T205 at:
#0 pthread_create /home/a/.local/lib/zig/tsan/tsan_interceptors_posix.cpp:1023 (test+0x4aa75d)
#1 Thread.PosixThreadImpl.spawn__anon_37126 /home/a/.local/lib/zig/std/Thread.zig:775 (test+0x2859b6)
#2 Thread.spawn__anon_33629 /home/a/.local/lib/zig/std/Thread.zig:421 (test+0x24c7e9)
#3 thread_pool.ThreadPool((function 'processData')).init.6 /home/a/Projects/zig/http.zig/src/thread_pool.zig:55 (test+0x24c1da)
#4 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).init /home/a/Projects/zig/http.zig/src/worker.zig:461 (test+0x24cec0)
#5 httpz.Server(httpz.TestWebsocketHandler).listen /home/a/Projects/zig/http.zig/src/httpz.zig:422 (test+0x24e650)
#6 Thread.callFn__anon_28973 /home/a/.local/lib/zig/std/Thread.zig:507 (test+0x203b8d)
#7 Thread.PosixThreadImpl.spawn__anon_21378.Instance.entryFn /home/a/.local/lib/zig/std/Thread.zig:757 (test+0x1ee8f4)
Thread T209 (tid=1245165, running) created by thread T205 at:
#0 pthread_create /home/a/.local/lib/zig/tsan/tsan_interceptors_posix.cpp:1023 (test+0x4aa75d)
#1 Thread.PosixThreadImpl.spawn__anon_37126 /home/a/.local/lib/zig/std/Thread.zig:775 (test+0x2859b6)
#2 Thread.spawn__anon_33629 /home/a/.local/lib/zig/std/Thread.zig:421 (test+0x24c7e9)
#3 thread_pool.ThreadPool((function 'processData')).init.6 /home/a/Projects/zig/http.zig/src/thread_pool.zig:55 (test+0x24c1da)
#4 worker.NonBlocking(*httpz.Server(httpz.TestWebsocketHandler),httpz.TestWebsocketHandler.WebsocketHandler).init /home/a/Projects/zig/http.zig/src/worker.zig:461 (test+0x24cec0)
#5 httpz.Server(httpz.TestWebsocketHandler).listen /home/a/Projects/zig/http.zig/src/httpz.zig:422 (test+0x24e650)
#6 Thread.callFn__anon_28973 /home/a/.local/lib/zig/std/Thread.zig:507 (test+0x203b8d)
#7 Thread.PosixThreadImpl.spawn__anon_21378.Instance.entryFn /home/a/.local/lib/zig/std/Thread.zig:757 (test+0x1ee8f4)
SUMMARY: ThreadSanitizer: data race /home/a/.local/lib/zig/tsan/tsan_interceptors_memintrinsics.cpp:35 in __tsan_memset
Also seeing failures, I'm willing to try to put together a pull request to address this.
See #136 , which seemingly resolves this race:
$ zig build test -Dtest-filter="quick shutdown" -Dtsan=true
info(websocket): received shutdown signal
httpz: quick shutdown (54.20ms)
1 of 1 test passed
Slowest 1 test:
54.20ms httpz: quick shutdown
I'm a little hesitant here since I don't fully understand why it's ok to signal the semaphore only after self.loop.monitorAccept since it seems that the listener socket (which I believe is the resource that was being raced for) is used two other times in the run function.
either way I'm pretty confident in the fix given that it resolves the tsan errors.
I got another "websocket: upgrade" error (after pulling the latest commit), stacktrace is a little different from same error above
websocket: invalid request (44.70ms)
debug(websocket): (127.0.0.1:59402) received text message
warning(websocket): (127.0.0.1:59402) error reading from connection: error.WouldBlock
General protection exception (no address available)
================================================================================
"websocket: upgrade" - ConnectionResetByPeer
================================================================================
/home/a/.local/lib/zig/compiler_rt/memset.zig:19:14: 0x1415630 in memset (compiler_rt)
d[0] = c;
^
/home/a/.local/lib/zig/std/mem/Allocator.zig:417:26: 0x1170048 in free__anon_4360 (test)
@memset(non_const_ptr[0..bytes_len], undefined);
^
/home/a/.local/lib/zig/std/array_list.zig:73:36: 0x1220f97 in deinit (test)
self.allocator.free(self.allocatedSlice());
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/proto.zig:498:24: 0x1243a03 in deinit (test)
self.buf.deinit();
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/proto.zig:375:25: 0x12e6c62 in done (test)
f.deinit();
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1617:26: 0x12ee73c in _handleClientData__anon_53119 (test)
defer reader.done(message_type);
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:1582:29: 0x12df056 in handleClientData__anon_52214 (test)
return _handleClientData(H, hc, allocator, fba) catch |err| {
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:752:40: 0x12deed7 in _dataAvailable (test)
const ok = handleClientData(H, hc, self.allocator, &fba);
^
/home/a/.cache/zig/p/websocket-0.1.0-ZPISdXNIAwCXG7oHBj4zc1CfmZcDeyR6hfTEOo8_YI4r/src/server/server.zig:738:39: 0x12d3490 in dataAvailable (test)
return self._dataAvailable(hc, thread_buf) catch |err| {
^
/home/a/Projects/zig/http.zig/src/worker.zig:825:64: 0x12c0e9b in processWebsocketData (test)
const success = self.websocket.worker.dataAvailable(hc, thread_buf);
^
/home/a/Projects/zig/http.zig/src/worker.zig:783:61: 0x12a9c7a in processData (test)
.websocket => |hc| self.processWebsocketData(conn, thread_buf, hc),
^
/home/a/Projects/zig/http.zig/src/thread_pool.zig:244:17: 0x12779ac in run (test)
@call(.auto, F, full_args);
^
/home/a/.local/lib/zig/std/Thread.zig:488:13: 0x124dc5a in callFn__anon_40899 (test)
@call(.auto, f, args);
^
/home/a/.local/lib/zig/std/Thread.zig:757:30: 0x1224684 in entryFn (test)
return callFn(f, args_ptr.*);
^
???:?:?: 0x7f4ae28c27ea in ??? (libc.so.6)
Unwind information for `libc.so.6:0x7f4ae28c27ea` was not available, trace may be incomplete
???:?:?: 0x7f4ae294618b in ??? (libc.so.6)
thread 123266 panic: reached unreachable code
/home/a/.local/lib/zig/std/posix.zig:880:27: 0x11a81d0 in read (test)
.CONNRESET => return error.ConnectionResetByPeer,
^
/home/a/.local/lib/zig/std/net.zig:1864:9: 0x1287d00 in read (test)
return posix.read(self.handle, buffer);
^
/home/a/Projects/zig/http.zig/src/httpz.zig:1477:21: 0x136517b in test.websocket: upgrade (test)
else => return err,
^
/home/a/.local/lib/zig/std/debug.zig:550:14: 0x118f37d in assert (test)
if (!ok) unreachable; // assertion failure
^
/home/a/.local/lib/zig/std/hash_map.zig:872:19: 0x123ee71 in putAssumeCapacityNoClobberContext (test)
assert(!self.containsContext(key, ctx));
^
/home/a/.local/lib/zig/std/hash_map.zig:848:51: 0x1207b8d in putNoClobberContext (test)
self.putAssumeCapacityNoClobberContext(key, value, ctx);
^
/home/a/.local/lib/zig/std/hash_map.zig:328:54: 0x11d185e in putNoClobber (test)
return self.unmanaged.putNoClobberContext(self.allocator, key, value, self.ctx);
^
/home/a/.local/lib/zig/std/debug/SelfInfo.zig:486:38: 0x11d141c in lookupModuleDl (test)
try self.address_map.putNoClobber(ctx.base_address, obj_di);
^
/home/a/.local/lib/zig/std/debug/SelfInfo.zig:133:35: 0x11d18fd in getModuleForAddress (test)
return self.lookupModuleDl(address);
^
/home/a/.local/lib/zig/std/debug.zig:1092:50: 0x11dafe2 in printSourceAtAddress__anon_31593 (test)
const module = debug_info.getModuleForAddress(address) catch |err| switch (err) {
^
/home/a/.local/lib/zig/std/debug.zig:736:33: 0x11f7e0d in writeStackTrace__anon_33965 (test)
try printSourceAtAddress(debug_info, out_stream, return_address - 1, tty_config);
^
/home/a/.local/lib/zig/std/debug.zig:529:24: 0x11b5372 in dumpStackTrace (test)
writeStackTrace(stack_trace, stderr, debug_info, io.tty.detectConfig(io.getStdErr())) catch |err| {
^
/home/a/Projects/zig/http.zig/test_runner.zig:99:45: 0x137fe86 in main (test)
std.debug.dumpStackTrace(trace.*);
^
/home/a/.local/lib/zig/std/start.zig:660:37: 0x1382fa7 in main (test)
const result = root.main() catch |err| {
^
???:?:?: 0x7f4ae28546b4 in ??? (libc.so.6)
Unwind information for `libc.so.6:0x7f4ae28546b4` was not available, trace may be incomplete
???:?:?: 0x7f4ae2854768 in ??? (libc.so.6)
???:?:?: 0x116f284 in ??? (???)
test
└─ run test failure
error: the following command terminated unexpectedly:
/home/a/Projects/zig/http.zig/.zig-cache/o/6a33db4328f6f9a75dcbf75c43a58922/test
Build Summary: 3/5 steps succeeded; 1 failed
test transitive failure
└─ run test failure
error: the following build command failed with exit code 1:
/home/a/Projects/zig/http.zig/.zig-cache/o/bb266f633544e75bc14a570e34777270/build /home/a/.local/bin/zig /home/a/.local/lib/zig /home/a/Projects/zig/http.zig /home/a/Projects/zig/http.zig/.zig-cache /home/a/.cache/zig --seed 0xced1cfc3 -Z4cd718503626a8ad test
Edit: I am still seeing data races when running tests with -Dtsan
@ItsMeSamey I think this bug is specifically tracking the failures in the quick shutdown test: there's also #125 to track other race conditions, so perhaps we should move the conversation there?
I see you edited your comment to say you're still getting tsan failures in the quick shutdown test: can you post the full test command and output?
I tested my fix with -Dtest-filter="quick shutdown to ensure I was only running that one test and I was unable to reproduce the race after the fix
I don't have permission to, but feel free to close this issue in favour of #125
I can duplicate this stack trace
Zig 15 latest (889) SSE connections Dies on stream.close() Linux only (epoll) - Mac and bsd are fine
I suspect it’s a double close on the fd. During disown() it calls posix.close() then later when epoll tries CTL_DEL during remove() it hits BADF
If I comment out the epoll call in remove(), it works perfectly, and I can see that the fd’s are correctly recycled on subsequent requests. This is not a fix - it’s a proof of a double free via a dirty hack
Still investigating
To duplicate - build this (on Linux) with zig latest, run example_1, and hit any action to see the crash https://github.com/zigster64/datastar.http.zig
Observe that no problem on mac
That's an issue, but I think it'll be specific to SSE connections (or anything else that "disown's" the connection from httpz).
I don't think the issue is that it calls close on disown (it doesn't?). I think the issue is that your defer stream.close(); (which you should be doing), is happening before httpz gets around to removing the socket from epoll. It'll have to be removed more aggressively, likely from the worker thread itself (thankfully, epoll_ctl is threads safe), before startEventStreamSync returns.
Nice analysis .. yep, will study the flow around shutting down that sse connection
Keep in mind that kqueue is working just fine with the exact same flow
@zigster64 I tried your project with the new sync_disown branch and it seems to work.
Works perfectly - and still works on Mac + BSD
Great stuff ! That was a bit tricky that one