zig Proposal: Event loop redesign

The current event loop is not ready yet (relatively slow, windows unfinished, bugs/races) and many wish for it to be. From the discord communities at least, there seems to be enough interest to warrant addressing this, with some interested in helping out. My question then is: what do people really want out of the standard libraries event loop? Toying with some designs locally produced incomplete implementations from lack of a direction. In addition to what people want from an event loop, how do they also want it implemented? Here are some example properties:

Multi-threaded by default?
- Threading is customizable (NUMA, affinities, priorities)?
- Work-stealing or work-requesting scheduler? configurable?
Globally configured?
- Other standard library stuff subscribing to the event loop automatically/case-by-case?
- Multiple event loops in the same program?
Performance goals?
- Prioritize simple implementation over absolute efficiency?
- What sort of sacrifices should be made here to accommodate other metrics?
- Is there a definite goal it should strive to maintain (e.g. faster than X in Y)?
What features should it provide API wise and how would those look like?
- Ability to lock execution to a specific OS thread?
- Influence scheduling (hard/soft priorities, QoS classes, fairness)?
- Scheduling patterns (batch, switchto, scheduling cancellation)?
What should sync primitives (i.e. Mutex, Channel) and event sources (i.e. I/O, timers) look like?
- Support cancellation?
- Are allowed to dynamically allocate (i.e. heap alloc) internally?
- Require user-provided, intrusive memory to operate?
How async should it be?
- Usable as a C api?
- Usable without zig features (i.e. comptime, threadlocal, async) for easier bootstrap?
Should we even have an event loop at all?
- Only provide thread pool / io-poller / timer datastructures separately?
- GCD/libdispatch approach? (user creates distinct/specific work-queues and pushes work into them per app)
- Global thread-pool/executor approach? (think windows ThreadPool or windows UMS)

I'm open to adding more considerations if others have some. Just want to understand the requirements before making any sort of PR.

Mar 12 '21 16:03 kprotty

Assuming we continue with the idea of having an "event loop" abstraction, I would like to see from it some form of consistency across different OSs, at least when it comes to the big 3.

More specifically, I believe one important goal should be to ensure, as much as possible, that if an application doesn't deadlock on OS Y it also doesn't deadlock on OS X (hah pun kinda intended). I am not sure what is the current state of things, but one example that comes to mind is file I/O which only simulates evented I/O by using a thread pool.

Somewhat related to the above goal, there should be an API (or a clear pattern) for dealing with blocking I/O (and CPU intensive operations). As a concrete example, in bork I used to steal one thread from the event loop when using BearSSL and, depending on the CPU configuration of the host (and other implementation details of the event loop that fortunately in this case didn't pose a problem) it could simply deadlock.

If instead the event loop gets broken down into discrete components, then this question IMO still stands and would need to be re adapted for the new context.

Mar 12 '21 17:03 kristoff-it

I want one that prints "bruh" asynchronously

Mar 15 '21 02:03 haze

I'll put my high-level thoughts down in points here since my thoughts are a little all over the place when it comes to Zig's async story. The perspective I'm coming from with my thoughts here is after spending the last few months implementing a few resource-constrained high-performance distributed systems and async-driven sensor applications in Zig, and spending the last month researching how async works in other programming languages.

It's really, really hard to design for a general-purpose, all-in-one event loop that everyone can be happy with. I don't think we should aim towards such an amalgamation, and that we should instead aim towards creating production-ready modules (I/O reactors, I/O proactors, thread pool executors, synchronization primitives) that a developer may then combine into an event loop design that they are happy with.

I personally have a lot of frustrations over some of the design decisions in event loop amalgamations featured in the libraries and runtimes for Go/Rust/Nim/Pony/etc. At the very least, aiming towards modularizing the event loop is work that can be done with Zig today that would undoubtedly be reusable should the time come that we eventually come to a consensus on an all-in-one event loop design.

There should be a focus in the standard library on supporting the lowest-common-denominator subset of cross-platform features that can be driven asynchronously by an I/O poller. To paint a picture, there should be a single io module in the standard library that presents first-class async support for sockets, timers, pipes, events (i.e. eventfd in Linux), etc.

Given that async file I/O really is something that has only been recently been made available in Linux via. io_uring, the io module should only contain support for blocking file I/O. An experimental module could optionally be included inside the standard library which provides Linux support for async file I/O.

There should also be cross-platform support for async Ctrl+C handling in io, with a separate, optional experimental module for async signal handling support.

I feel that the cross-platform abstraction for I/O pollers in the Zig standard library should follow the proactor pattern, and that all async I/O modules have completion-based APIs (callback-based). This would allow for easy C interoperability (pass a callback), and also only requires minimal boilerplate for being driven by async/await in Zig by wrapping the callback into an async frame. A number of battle-tested cross-platform async I/O libraries like libuv, libevent, and asio follow the proactor pattern as well.

This comes from my experience that abstracting a proactor-based I/O poller (e.g. IOCP) to follow a reactor-based design requires quite a bit of hackery and workarounds vs. abstracting a reactor-based I/O poller to follow a proactor-based design.

Having such a design where either a callback may be provided, or an async frame may be created to wrap under a callback, allows for users to have flexibility in how they want to allocate the memory necessary to asynchronously drive an I/O primitive (i.e. on the heap, or on the stack).

All async I/O and synchronization primitives should support cancellation and composition by design. The memory and CPU overhead for not supporting cancellation or composition is negligible for most applications.

With a production-ready I/O poller and thread pool abstraction, either way, it is most likely the case that you wouldn't settle with general-purpose implementations of async I/O or synchronization primitives that are easy to use if you are in a situation where the smallest amount of overhead present in supporting cancellation and compositional operations matter to you.

Much like the rest of the standard library, I feel that allocations should always be left up to the user. Therefore, where possible, memory should be intrusive.
The thread pool should support being configured with callbacks that are to be run in certain places (e.g. when a thread is spawned). A common use case for this is e.g. initializing thread-local state (tracy encourages users to have all user-spawned threads marked with names via. it's C API), and logging events.
I personally dislike global flags like the io_mode flag, and would rather have individual modules be configured to operate in a blocking/non-blocking manner. Hence, unlike status-quo Zig today, my suggestion is the instantiation of an I/O poller and thread pool and more should be made explicit in a user's program.
Apart from the overhead of supporting cancellation and async chaining operations (i.e. wait all ops to finish, race for either one of many ops to finish, etc.), all modules should be optimized for performance. For something as low-level as asynchronous I/O and thread pooling, performance and a good user-experience in coordinating async operations should be the only two things people really care about when it comes to a standard library.

Regarding a few of the points above, I didn't dig that deep into the specifics as I'd love to see other people's thoughts and feedback first given my one-sided point of view (i.e. how module-level configuration should look like, how async operations are to be facilitated by the standard library to be composed together, how I/O poller-related state should be separated and contained, etc.).

That being said, let me know if you spotted any glaring holes in my thoughts, or if you'd like me to explain more of my rationale on any one of the points above.

Mar 16 '21 19:03 lithdew

Any further opinions on this?

May 20 '21 22:05 kprotty

My question then is: what do people really want out of the standard libraries event loop?

Thanks @kprotty for setting this up.

Multi-threaded by default?

No, and yes, depending on the plane: control plane or data plane. Having a clear distinction between these helps to inform the design decisions for each separately because the safety and performance profiles are different.

For example, given Zig's approach to safety and performance, I would really want a single-threaded control plane, with this single-threaded control plane outsourcing to the kernel thread pool for I/O through io_uring (or kqueue or IOCP for other platforms) to reduce the performance cost of syscalls, context switches and locks, and to side-step or at least diminish the gap for data races for the most part. Backwards compatibility for older Linux platforms can be shimmed with an I/O threadpool, but io_uring should drive the design because it reflects where hardware performance is at today.

Influence scheduling (hard/soft priorities, QoS classes, fairness)?

QoS is important. In addition to the single-threaded control plane (with the kernel threadpool for I/O) I really want additional multi-threaded data plane(s) for CPU-intensive work [xor] long-running work, with clear QoS, i.e. one data plane for user-defined CPU-intensive tasks (e.g. bulk crypto like checksumming 256 MiB chunks, or erasure coding, or deduplicating them), and another separate data plane for long-running background tasks (stuff like DNS requests that are not CPU-intensive, that could take minutes and that could be context-switched without a performance hit).

So the data plane for CPU-intensive tasks should be sized according to the number of cores to reduce context switching. On the other hand, the data plane for long-running background tasks would be okay with context switching (being on a different order compared to the task time), and would be sized larger than the number of cores to support concurrency.

Support cancellation?

This could solve alot of pain for people. Cancellation can also be racy and difficult to get right. Then again, it's hard at every layer and solving this lower down can help higher up. My gut feel is it's hard to do well. On the other hand, there could be cancellation for everything, but with a completion event for when the cancellation succeeds, so that this could shim over underlying abstractions that don't support cancellation, while not holding back the abstractions that do.

Are allowed to dynamically allocate (i.e. heap alloc) internally?

No. Although there might be a good argument to made for something like registered buffers, with a user-provided allocator of course. The focus overall though should be on zero-copy into user provided buffers. For TigerBeetle, everything is static allocation (even disk) and zero-copy. We would never use something that allocated dynamically, unless this was at initialization, where we have some kind of control over the parameters, and again for something like registered buffers. Allocation for stuff like internal queues etc. is also okay but it needs to be subject to the user's restrictions and at initialization only.

How async should it be?

Zig's async is awesome of course. However, here's something surprising from our experience. We tried out Zig's async plus io_uring for TigerBeetle (https://github.com/coilhq/tigerbeetle/blob/main/src/io_async.zig) and then actually went back to explicit callbacks plus io_uring in the end (https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig). The reason being that we were doing this for a distributed consensus protocol where we wanted to make sure our message handlers run to completion with the same system state, whereas coroutines increase dimensionality while a function is still running. We wanted clear jumping off points to I/O just because getting the consensus protocol right was hard enough. This is specific to our use-case for TigerBeetle only, it might not be relevant here, but wanted to share the anecdote if it helps.

Should we even have an event loop at all?

The money question. We could start with the primitives. For example, @ifreund and I worked hard to get an event loop primitive going for io_uring that could mimic epoll's "poll for some amount of time": https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig#L52-L83

io_uring

io_uring from the 5.6 kernel on up has almost everything we need for the data plane intensive aspects of storage and network I/O, and even the 5.8 kernel is gaining support. io_uring is a beast and it's only accelerating from here. It's something that's hard to ignore.

While we could plan according to current support for the 5.8 Linux kernel right now, perhaps a better approach would be to plot a course taking Zig's 1.0.0 trajectory into account.

With that in mind, I really want to see the IO component of the event loop taking full advantage of io_uring as a first-class citizen, because that's where the puck is already at, just not evenly distributed.

NVME devices have become so fast that the cost of a context switch is on the same order as an I/O operation, so io_uring makes sense to support the next generation of lockless thread-per-core designs, which is what we're doing for TigerBeetle to support a million financial transactions a second.

It's also a perfect time for Zig to take advantage of this shift in API design. I would really want to start with io_uring and work backwards rather than the other way round. A worker threadpool or traditional approaches to event loops should be last-resort, not something driving our design.

The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it.

May 25 '21 08:05 jorangreef

Thanks @kprotty for setting this up.

Thanks @jorangreef for the detailed response! This is an area where I strongly believe zig should make some good decisions after having interacted with other async/IO environements so any input/ideas are welcome.

control plane or data plane

Could you elaborate on the differences between the two?

In regards to performance, particularly throughput, it sounds difficult to reach its optimum if there's only one thread which handles control flow (if this is what you're implying). Having multiple threads performing application logic and without locking sounds ideal for throughput in the absence of effective static work distribution.

I do agree that io_uring should somewhat drive the design for the I/O subsection of the Event Loop. I say "somewhat" as SQEs are allocated internally on flush (at least currently) and it may not be what is desired for io_uring compatibility layers.

My gut feel is it's hard to do well

This has been my experience as well

Allocation for stuff like internal queues

This was my primary intent when posing the question. While I have working thread pools which only utilize intrusive memory, I have been interested if using dynamically growing but contiguous task queues would help in latency. This sort of memory allocation sounds contrary to your idea though.

We wanted clear jumping off points to I/O

By "clear", would this be a reference to a function call doing I/O without the callers awareness? or some other scenario?

an event loop primitive going for io_uring

This is pretty neat! Like the idea of intrusive SQE/CQE structs and the flushing/polling strategy. This would solve the "io_uring SQE allocation" issue at the abstraction level.

io_uring is a beast and it's only accelerating from here.

This was my initial experience too. Although having micro-benchmarked it against epoll, the decrease in syscalls and use of kernel thread-pool that sounds more efficient on the surface ended up being slower. Do you (or anyone else really) have any ideas why this might be the case?

I really want to see the IO component of the event loop taking full advantage of io_uring as a first-class citizen,

Definitely. Although it's a bit rough trying to envision how this would manifest itself higher-level API-wise. Have personally been looking into libdispatch for scheduling and queuing insight/inspirations recently.

May 25 '21 22:05 kprotty

Could you elaborate on the differences between the two?

Sure! In the abstract, the control plane is just anything that's not in the critical data path, it's safety critical but not performance critical, whereas the data plane has the inverse profile. For example, it would be fine to have (and one would want) plenty of assertions in the control plane for safety, whereas the data plane is in the critical path with huge volumes flowing through it. Here in the data plane one would want to optimize for performance: cache misses, context switches, branch mispredicts etc. The data plane would be like a water mains or oil pipeline, there's nothing inside to obstruct flow. The control plane would be all the safety checks you do outside the pipeline as an operator, the little control box that adjusts pressure and controls the pipeline. Having this clean split in the design provides both safety and performance without compromising either. It also obviates in large part the need for a borrow checker, although data races are still possible.

As a concrete example of this technique, in TigerBeetle, we have around 10,000 financial transactions in a batch. Our single-threaded control plane is responsible for switching each of these batches through the consensus protocol, and we amortize all runtime bounds checks, assertions, syscalls and I/O across the batch, so that these become almost free, yet we have literally hundreds of assertions, and we're doing O_DSYNC on every write etc.

In regards to performance, particularly throughput, it sounds difficult to reach its optimum if there's only one thread which handles control flow (if this is what you're implying). Having multiple threads performing application logic and without locking sounds ideal for throughput in the absence of effective static work distribution.

I think there's also a third option. For example, in TigerBeetle, if we want to do crypto for a batch (we have a use-case for this for Interledger), we would then drop that into a multi-threaded data plane with QoS for CPU-intensive work, by dividing the batch up across threads. We wouldn't want to solve that kind of performance problem with a multi-threaded control plane, because that would introduce too much complexity into a safety-critical domain, and also be conflating the control plane and data plane. In terms of this approach, the performance-critical work should happen in the multi-threaded data plane because that's been optimized for it. If we do find any scalability limit, then we would simply move to multiple event loop abstractions per domain, in our case across networking, storage, and state machine, all connected by ring buffers, and all the control planes still single-threaded.

This is also not new to TigerBeetle, we based this design on the work done by LMAX, a high-performance trading platform, which has single-threaded control planes for all event loops, and which has multiple event loops connected by ring buffers. They found the single-threaded control plane is also faster because it lets the CPU run unimpeded by context switches, like a sprinter doing the 100m in a straight line without zigzags. Martin Thompson calls this "mechanical sympathy" or "the single writer principle" (great blog post). He's done some awesome talks on this that we've documented here: https://github.com/coilhq/tigerbeetle/blob/main/docs/DESIGN.md#references

Redpanda is another new database, from people who worked on ScyllaDB, also doing thread-per-core.

Allocation for stuff like internal queues This was my primary intent when posing the question

Just speaking from our point of view, it would be totally fine to allocate this at initialization, which coincidentally is also how io_uring does it for the most part. Contiguous task queues would definitely be great for latency to reduce cache misses and avoid pointer-chasing. We would really like fixed-size bounded queues with an error for overflow and back pressure (in our IO code we still need to set our upper bounds for SQ overflow, we're yet to profile some of our upper bounds). Explicit limiting is a good thing, especially for any kind of distributed system. This is probably also an area in event loops where developers have been used to "hidden allocation" and unbounded IO depth but I think we could be more explicit about this.

By "clear", would this be a reference to a function call doing I/O without the callers awareness? or some other scenario?

Yes, this would be where the function call ends up doing I/O that suspends so that when the caller is resumed, the rest of the system has moved on to a new state. For example, in the Viewstamped Replication consensus protocol that we use, it's critical that distributed messages are only processed when a replica process' status is .normal and where the replica process' view number (or term or epoch as with other consensus protocols) is exactly the same. We wouldn't want state like this changing during the lifetime of a function. Again, these examples are specific to TigerBeetle and Zig's async/await still has plenty of good use cases. Our team just decided on the explicit pain of colored function call sites for safety. Here's an example of a footgun, just for us, where the use of async I/O made this kind of bug much more likely (we're still weaning this code off async and onto callbacks): https://github.com/coilhq/tigerbeetle/blob/main/src/vr/journal.zig#L522-L536

This is pretty neat! Like the idea of intrusive SQE/CQE structs and the flushing/polling strategy. This would solve the "io_uring SQE allocation" issue at the abstraction level.

Thanks! We learned alot from the event loop gists you did!

Although having micro-benchmarked it against epoll, the decrease in syscalls and use of kernel thread-pool that sounds more efficient on the surface ended up being slower. Do you (or anyone else really) have any ideas why this might be the case?

Yes, there was a known networking regression in 5.7.16 that has since been patched and @axboe has also been optimizing against epoll significantly since then so that io_uring should be ahead on most benchmarks. In terms of storage, io_uring performance on NVME can be more than double compared to blocking syscalls and more than quadruple compared to an I/O threadpool like libuv. Overall, what excites me is that io_uring offers a single unified interface for networking and storage and potentially more and more syscalls down the line. There are some rough edges, like what we had to do to implement IO.run_for_ns() but there is already better support for that in io_uring_enter() in newer kernels. It's also early days and there are incremental improvements all the time.

Have personally been looking into libdispatch for scheduling and queuing insight/inspirations recently.

Thanks for the link! I'm definitely looking up to you when it comes to anything like this, and look forward to seeing your work on this.

May 26 '21 08:05 jorangreef

Just a experience from studying Zig language: when I saw that all memory allocating function actually take an allocator as a parameter, I actually expected that IO functions would take an event loop as a parameter. Or at least that one could switch seemlessly between different implementations or instances of the main loop. Same as with allocators, user might have different needs for it so I'd expect standard library would play in concert with many possible implemetations. Also given how minimal Zig tries to be, I was really surprised by the fact that making a program async suddenly starts number of threads by itself, which clashes with the "no hidden control flow" slogan on the homepage.

Jun 17 '21 10:06 vlada-dudr

IO functions would take an event loop

@vlada-dudr The goal of the discussion is to shed more light on the "hows" and "whys" of asynchronous computing. This also leads to a position that an "event loop" may not even be the right abstraction. Not all IO can be done asynchronously (or at least efficiently with async) and IO done with and without an "event loop" could have very different semantics. For example async wouldn't work with the virtual dispatch abstraction as seen in std.mem.Allocator and "async" IO may not be able to just operate on raw, non-owned, file descriptors/handles similar to the blocking apis.

I was really surprised by the fact that making a program async suddenly starts number of threads by itself

Zig is minimal, but the stdlib is there to do certain things for you. This includes setting up custom handles, looking up vdso functions, discovering tls, etc. Spawning threads could be considered another thing it does to help do other stuff (simplify IO async frame scheduling) for you.

which clashes with the "no hidden control flow"

I don't think so because the creation of threads doesn't disrupt the control flow of your program. If you use an stdlib function which schedules to said threadpool, it could switch threads between suspends. Wouldn't necessarily classify this as "hidden" given suspend or at least a function call is the explicit control flow jump

Jun 17 '21 12:06 kprotty

@kprotty:

The goal of the discussion is to shed more light on the "hows" and "whys" of asynchronous computing. My post intentionally didn't touch implemetation. I think one "why" might be "because we want have higher level API look like this". Actually high level API is something which newcomers (like me) will be confronted with. It's readability and coherency with other parts of the standard library will be major arguments to why choose zig or not, at least for most people. That is why I shared my impressions with you, so you know how current state made a random person think about it.

Spawning threads could be considered another thing it does to help do other stuff (simplify IO async frame scheduling) for you. Well, most other languages would do it. But my impression from Zig was, that the programmer should be way more in charge and explicit then usual.

I don't think so because the creation of threads doesn't disrupt the control flow of your program. I guess it depends on the program type. But maybe in case it is a problem for someone, the person would bring his own "event loop" or "async library" for the special usage. I don't know.

Thanks for reaction. Maybe I should have posted it on irc, not here, so this can be free of random gibberish and keep it highly technical. Cheers!

Jun 17 '21 12:06 vlada-dudr

@vlada-dudr For the "hows" and "whys" part, I meant it as context to transition into why I don't think passing an "event loop" around could be equated to the Allocator approach, rather than a critique of missing impl. I believe your post did touch "how" (explicit loop ref) and "why" (similar to allocator).

It's readability and coherency with other parts of the standard library will be major arguments to why choose zig or not, at least for most people.

I think this is where i'm at a fork. On one hand, I would like to have a super-customizable and near-optimal async building-blocks library since I enjoy polishing concepts. But on the other hand, there seems to be enough prioritizing simplicity/easy-understanding (even with discussions in other zig communities (come join #async!) which comes at a price to the former. Practically, both have optimization opportunities in mind but having issue deciding if Zig (the stdlib rather) should target "efficient hardware utilization" as a high priority or "more efficient hw util than other environments" (or something else?). Have any thoughts on this meta issue?

my impression from Zig was, that the programmer should be way more in charge and explicit then usual.

Im a bit confused in how slippery this slope should be for the stdlib. In regards to the other useful features I noted that happen in start.zig before the traditional main() is called, should those be disabled by default and explicitly enabled? The reason this is slippery is because disabling them by default would make Zig no safer/easier than C for newcomers and could arguably subtract from their first experience which tends to value ease-of-use from my understanding.

But maybe in case it is a problem for someone, the person would bring his own "event loop" or "async library" for the special usage.

Would be helpful to have an example to go over. I've recently played around with a bring your own blocking impl for synchronization primitives which allows the same algs which implement things like Channels and Mutexes to work generically for blocking and asynchronous environments. Would it be something similar to this?

Maybe I should have posted it on irc, not here, so this can be free of random gibberish

I'm of the opinion that gibberish is fine so long as it's on topic to the discussion. An ok-ish but not so great example lol. I could try hopping on IRC later today to continue the discussion if needed.

Jun 17 '21 13:06 kprotty

My input requires a bit of context, so I'll start with that. Firstly, I'd like to somewhat echo an earlier comment that said:

Assuming we continue with the idea of having an "event loop" abstraction, I would like to see from it some form of consistency across different OSs, at least when it comes to the big 3. More specifically, I believe one important goal should be to ensure, as much as possible, that if an application doesn't deadlock on OS Y it also doesn't deadlock on OS X (hah pun kinda intended).

Later on, the thread has moved more in the direction of discussing more state of the art approaches, e.g.:

It's also a perfect time for Zig to take advantage of this shift in API design. I would really want to start with io_uring and work backwards rather than the other way round. A worker threadpool or traditional approaches to event loops should be last-resort, not something driving our design. The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it.

With those in mind, this is what I'd like to say:

I think that there's a difference between a good/optimal/pick-your-word event loop[1] design, and one that is appropriate for a standard library. My experience so far has been that Zig wokrs for a large number of targets, meaning both architectures and OSes (or lack thereof, but that's probably out of scope for stdlib discussions). Given that, opting for a linux-centric event loop design could end up at least somewhat dangerous/counterproductive:

an advanced event loop is presumably harder to design, implement and (perhaps most importantly) maintain. Plus, there may be a shortage of stdlib authors/contributors that have the requisite background
if we consider "traditional" approaches to be "last-resort", we'd essentially be committing to a sub-par experience for non-Linux crowd. This might not be a big issue given the demographic of current userbase, but who knows what might happen in the future

I should say that I agree with the earlier quote The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it., but I don't think that this should come at the expense of general usability/approachability. I personally don't see a big issue with a lowest common denominator approach for the standard library event loop, and I don't think that goes against the quote - we can still have a modern, well made event loop for Linux[2], it just perhaps shouldn't be the canonical implementation throughout stdlib.

[1] or any other component for that matter [2] maybe even in the standard library somewhere

Jul 06 '21 13:07 kvelicka

@kvelicka

opting for a linux-centric event loop design

FWIW io_uring isn't necessarily linux-centric, but made popular by linux since it's more flexible. The design of "batching" and "asynchronous" syscalls has been seen elsewhere, notably for high-performance I/O in the form of Windows RIO. The idea basically just a queue of I/O operations that you submit and reap the results from, with the ability to wait for results. Given that description, it makes the API a lot simpler - and can be trivially modeled with existing APIs: libuv could be seen as a very close, but higher level, relative.

an advanced event loop is presumably harder to design, implement and [...] maintain

I agree, but am of a different notion on which of the two approaches would be labeled "advanced". I would like to argue is that while an efficient design of the "traditional" approach of a multi-threaded work-stealing event loop could be more familiar to setup/interact with for someone coming from say Go/Rust/Erlang, it would be much harder for an stdlib maintainer to change, understand, and modify. Designing an efficient scheduler for such an event loop is non-trivial, and when you look at advanced work-stealing schedulers like those of Go, tokio, and rayon only a hand-full of contributors make feature/bug changes due to their large scope, multi-purpose algorithms, and sometimes complexity. This has also been my experience as well, although I'm open to be convinced that the ramp-up time may not be as big of an issue in practice.

there may be a shortage of stdlib authors/contributors that have the requisite background

Once again, I think this rings true especially for efficient multi-threaded applications. Vetting them can be difficult and understanding atomics / memory ordering is known to be one of the "background specialization" areas IME.

we'd essentially be committing to a sub-par experience for non-Linux crowd

For some insight, we (@lithdew and I) recently created a prototype of a io_uring emulation layer on linux (using the "traditional" epoll interface) and found it to be as fast and often faster than using io_uring itself (probably due to the older 5.12 kernel @jorangreef mentioned). We also found that the traditional multi-threaded event loop, while providing marginally higher throughput for network requests than its competitors (Go/Tokio), was ultimately worse in throughput and average latency versus the "io_uring" (emulation or not) approach. This is all to say that it would be close to a universally sub-par experience (not only Linux) given kqueue works similar and Windows can be made the same as well with the use of AFD although i'm yet to test this perf claim for windows.

I don't think that this should come at the expense of general usability/approachability

I agree as well. The thing is that the higher level API of socket.read() being done asynchornously doesn't have to change between the two approaches. What changes is mostly how they decide to multi-thread the IO layer if they please, without having multi-threaded overhead by default:

The multi-threaded event loop abstraction moves that decision into the event loop implementation and makes selectively going single-threaded for IO need more thought & configuration. The upside is that you don't have to worry about dynamic work distribution across threads. The downside is that you have to synchronize common operations and leave scheduling decisions up to the event loop implementation.
The io_uring abstraction moves that decision to the user and makes going dynamically multi-threaded more thought & configuration. The upside is that IO scales linearly based on ring instances without having to do much synchronization. The downside is that it favors static work distribution and making it dynamic is more challenging.

I know i've been shilling the io_uring abstraction recently, but I have reasons in doing so:

Static work distribution for I/O-bound work loads seems to scale better than dynamic work distribution using the traditional approach, something which invalidated my expectations having trying to optimize work stealing as much as possible.
Single-threaded I/O is much easier to reason about when it comes to cancellation and memory reclamation. A major issue @lithdew and I faced with the "traditional" approach was figuring out how to soundly cleanup epoll/kqueue intrusively registered memory in a multi-threaded setting.
Work-stealing is ideal when it comes to throughput. But with the I/O benchmarks and @jorangreef idea of data/control plane, I realized that work-stealing is more so ideal for CPU-bound throughput. I've seen success in using the io_uring approach for I/O and offloading the blocking/CPU-intensive parts to a work-stealing thread pool.
Finally, we get to take advantage of a theoretically more efficient I/O abstraction API, even if it's not much better today. There's also been hints of other platforms like Windows getting an io_uring API in the future so it seems like a promising direction.

Even though i've changed my stance on the "traditional" approach recently, I'm still very much open to going back if the supporting arguments are technically supportive or if such a design is just not worth it for the standard library. I just know that either abstraction has a ton of room for optimization opportunities.

Jul 06 '21 15:07 kprotty

Firstly, I really appreciate the extensive and thoughtful response! And to be clear, I am not opposed to a io-uring[like] interface - I'm primarily Linux user myself and I don't currently have plans to develop anything for non-Linux targets, so I'm certainly not worried about incompatibility for my own sake. My worries are captured by what you said in the last paragraph - ...if such a design is just not worth it for the standard library. I'm happy to hear that you have thought about this topic a lot and I only have a couple of things to add:

I don't think you mentioned OS X in your response, not sure what the situation with a io_uring-like approach is there
you mentioned performance/efficiency a few times, so I'd like to dig a bit deeper on that:

Does the Zig stdlib event loop need to be particularly high performance? I think there's value in an implementation that tries moderately hard to be correct (thus understandable) and consistent across platforms (to a feasible extent), but putting performance as a second/third priority. One of things I find appealing about Zig is that a lot of things that'd often be implemented as language features in other langs are just code which can be plugged and reused as any other. So, in principle, it shouldn't be difficult to plug in a high performance event loop instead of the std one[1]. And if that is difficult, perhaps looking how to make that easier is a worthwhile avenue to explore?

[1] Having infra like this outside of stdlib of course comes with its own sets of issues and advantages, and I think that in medium-long term those might be just as (if not more) important than the technical merits! C++'s asio might be an interesting case study - it's a async I/O library/framework that has been used in production for years, and is now becoming the basis for C++ standard library networking - I think having significant battle experience outside of stdlib can be rather liberating as you're not as constrained by backwards compatibility, release schedule etc. Rust's async history/story is likely also relevant here but I've only watched that from afar so I don't have any useful insights there.

Jul 06 '21 17:07 kvelicka

I don't think you mentioned OS X in your response, not sure what the situation with a io_uring-like approach is there

Darwin (macos, ios, etc.) uses kqueue, an event notification scheme similar to linux epoll. The primary differences are that it supports batched "epoll_ctl"s via kevent(), it doesn't coalesce read & write events on the same fd, and it supports a higher variety of notifications without requiring an fd (timerfd, eventfd, inotify, etc.). It's integration to an io_uring emulation layer would be similar to epoll's in a lot of aspects.

Does the Zig stdlib event loop need to be particularly high performance?

It's one of those things that are implicitly in the Zig zen, particularly Avoid local maximums. The standard library tends to focus on being memory efficient as opposed to necessarily highly performant. You can see this in code by the heavy use of explicit allocators and intrusive memory. You can see this in the community with many zig libraries avoiding heap allocations when they can.

With this, however, is also implied somewhat performant code. This focus on memory/operational efficiency in many aspects of the stdlib (unicode, MultiArrayList, HashMap.Entry, Allocators) helps design APIs that are performant but also not too difficult to use. I think that is the direction the event loop should head towards as well, regardless of the abstraction-class chosen. This means avoiding interface restrictions like having to dynamically allocating memory to schedule a callback/anyframe, avoiding contention where possible, and reducing required syscalls.

there's value in an implementation that tries moderately hard to be correct (thus understandable) and consistent across platforms

There's a difference between "correct" and "understandable": Goroutines and channels are "correct" and arguably "understandable" while Zig async is still "correct" but, from my experience, doesn't seem to be immediately "understandable". The good thing is that when Zig async is understood, it tends to be one of those "it's obvious in hindsight" things like std.mem.Allocator which designed for efficiently (thus performant) without being too complicated (hopefully there's a pattern here — something idk if i'm explaining correctly).

As for platform consistency, this looks like one of the criteria that I considered essential by default for a standard library abstraction so I didn't give it much attention in my response: I also think it should be of higher priority than "performance".

it shouldn't be difficult to plug in a high performance event loop instead of the std one

This is the hard question that needs to be answered: what should such an API look like which allows a high performance implementation to be plugged into? I'm of the opinion that the API choice (in this case "traditional multi-threaded" vs "io_uring single-threaded") is what affects the performance ceiling of the implementation[1] and that a poor API effects the efficiency of the implementation [2].

C++'s asio might be an interesting case study - it's a async I/O library/framework that has been used in production for years, and is now becoming the basis for C++ standard library networking

Boost ASIO appears to be effectively the "traditional multi-threaded event loop" approach coupled with an MPSC channel called "strand" and no cross-platform support for file I/O (unless i'm missing something here). Would categorize it similarly to other "traditional" approaches (i.e. go, rust tokio)

[1] POSIX synchronous file I/O interface limits ssd/m.2 drive performance [2] Rust async's Waker and poll() APIs force Futures (coroutines) and I/O buffers used in completion based APIs like io_uring, posix aio, and windows iocp to be dynamically tracked (via ref counting usually) and heap allocated (via a global allocator usually).

Jul 06 '21 19:07 kprotty

I see, thank you again for the detailed responses - I've learned tons and my mind is at ease, you have certainly thought about this a lot!

Jul 06 '21 20:07 kvelicka

I support message passing as the only communication method between tasks, like Erlang/OTP. Functions should take a event loop argument if they want to use event loop.

Nov 20 '21 16:11 iacore

@locriacyber Message passing is an easy to grok, unified communication scheme. But it's not mechanically efficient given memory or serialization overhead it implies. Many times, shared memory is either more obvious or more resource efficient (e.g. Mutex) and I believe Zig should be explicit about those costs like it is with memory allocation and not provide only one high level abstraction.

As for event loop by function argument (instead of global), it sounds interesting. Would things like File/Socket/Thread take an event loop?

Nov 23 '21 14:11 kprotty

There is no cost if you only pass pointers around. The language user need to make sure not to use a pointer after it's being sent. This calls for event loop scoped memory allocator.

OS Thread is irrelevant to event loop. File and Socket operations generally don't need a event loop since they don't spawn new parallel tasks.

The current async/suspend feature in Zig is a language-level effect. A more elegant approach is to add an effect tracking system like the one in Koka allowing user-space defined suspend/resume. (That's a huge language feature.)

Nov 24 '21 17:11 iacore

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

Nov 24 '21 17:11 iacore

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

@locriacyber that's not how it works in Zig. The implementation details matter just as much as the high level user facing API. If you want to contribute to this discussion you need to familiarize yourself not only with the usage semantics of concurrent operations, but also with the tradeoffs involved with specific OS APIs and implementation designs. On top of that, we can't just trivially copy NodeJS because Zig is not garbage collected.

Nov 24 '21 17:11 kristoff-it

@kprotty quoted @locriacyber:

As for event loop by function argument (instead of global), it sounds interesting. Would things like File/Socket/Thread take an event loop?

I really like the API/pattern for memory allocation. Makes everything nice and explicit. That said, what would such an API look like for event loops?

Would you have a special event loop type for when you want blocking calls? Or would you have two versions of each call, one for blocking and one for using an event loop?

Nov 24 '21 19:11 kyle-github

There is no cost if you only pass pointers around.

@locriacyber The cost is that the pointers have to be escaped as they no longer have a static/linear lifetime which generally means heap allocation. This often isn't necessary when using other shared memory primitives like (async) Mutexes. There's also synchronization/serialization overhead in the channel implementations over the shared memory mutexes depending if they're multi-consumer or multi-producer.

This calls for event loop scoped memory allocator.

This would be the referenced overhead from earlier

OS Thread is irrelevant to event loop.

Erlang/BEAM enables SMP by default, similar to Golang. Is the event loop being referred to here single threaded?

File and Socket operations generally don't need a event loop since they don't spawn new parallel tasks.

Files and Socket APIs are connected to and depend on the event loop implementation, as the loop needs to be aware of them in order to poll IO asynchronously.

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

There's some discussion of removing the current multi-threaded event loop and replacing it with 3 things:

a cross platform io_uring abstraction
a single threaded event loop powered by it (sounds close to what you're describing)
a separate thread pool for running concurrent ops in ways to take advantage of parallelism

Nov 26 '21 18:11 kprotty

On top of that, we can't just trivially copy NodeJS because Zig is not garbage collected.

@kristoff-it We want an event loop mostly because we want to spawn tasks that will clean itself up after completion (non-linear, non-deterministic (when many tasks are queued) and delayed execution). The event loop will only manage memory necessary to store the task's frame.

@kprotty Single-threaded event loop is good enough for I/O tasks. Even for video games, a single-threaded event loop is good enough. I need to read about io_uring.

Dec 15 '21 16:12 iacore

It's been a while, but wanted to check in on this again. Would the following be a good consensus?

~~Remove the concept of io_mode~~ keep io_mode.
Single-Threaded event loop for IO-bound tasks (replaces std.event)
Multi-Threaded thread pool for CPU-bound tasks (std.Thread.Pool)

Feb 18 '22 15:02 kprotty

In Windows based OS, loop.accept could use ws32 AcceptEx which support overlappedIO Result which works with current 0.10 loop design. Is someone working on this already ? Current Build only throw "Os not Supported" - Compiler Error ?

Mar 20 '22 10:03 gajanak

@gajanak I tried to add it but there're a few limitations, mainly with the way current event loop tries to handle errors and windows worker calling finishOneEvent after pulling, making the task code harder to read and maintain. For more details see #10115

Mar 20 '22 15:03 Sashiri

I don't really like io_mode as it changes the language from underneath along with many of the abstractions within the stdlib without a way to use File and similar outside of the event loop. It would be nice if IO could be passed as a parameter (similar to Reader maybe?) as then the choice of which to use remains in the hands of the user of the event loop without having to implement their own variant of File and similar.

Mar 21 '22 14:03 tauoverpi

@tauoverpi To better understand, would this make things like File, Stream, StreamServer type generic or would their methods take in the IO type?

Other food for thought: should this pattern also extend to other things that could possibly block like synchronization primitives and timed sleeps/waits? Or should it be strictly for IO, or just File IO?

Mar 21 '22 16:03 kprotty

@kprotty Methods taking the IO type would possibly be cleaner as the underlying type (e.g File) doesn't change despite a different IO mode and allows it's inclusion within containers without comptime generics at the container type level.

I've used the type generic approach with timers and serial IO to ease testing timeouts and various read failures which has worked well (with hindsight, on methods would have been easier to work with) so going beyond IO to sync primitives and more could work well for testing and alternate loop implementations.

As a side note, this does start to feel like a capability/effects system.

Mar 22 '22 08:03 tauoverpi

zig zig copied to clipboard

Proposal: Event loop redesign

zig
zig copied to clipboard