Slow performance on Linux?
Hi, I have written a wrapper util on top of ipc_channel that handles the handshake, swapping channels between the host/child and adds a request/response API.
The performance on my M1 MBP was great, but I was surprised to find that the performance on Linux was significantly slower!
So I wrote a benchmark to test it out. The benchmark sends n requests, blocking on their responses (100k requests means 200k messages over the channel).
I'm not sure if it's my configuration, perhaps something else is interfering, however here are my results
Hardware
- Windows: AMD 5950x - Windows 10
- Linux: AMD 5950x - Fedora 39
- MacOS: M1 Macbook Pro
Results
| Platform | message count | duration |
|---|---|---|
macos |
10k | 0.487s |
windows |
10k | 0.356s |
linux |
10k | 2.301s |
macos |
100k | 1.550s |
windows |
100k | 3.497s |
linux |
100k | 13.608s |
macos |
1m | 14.404s |
windows |
1m | 34.769s |
linux |
1m | 150.514s |
Time taken for n round trip messages - Lower is better
I am have tried with/without the memfd option enabled and I have tried making this async (using tokio channels/threads) with the same outcome.
This is my wrapper (benchmarks are under examples)
https://github.com/alshdavid/ipc-channel-adapter
To run the benchmark run just bench {number_of_requests} e.g. just bench 100000
I'm investigating if another dependency is interfering, will update with my findings - but at the surface, any idea why this might be?
When running the benchmark using tokio and sending all the requests at once and waiting for them to return concurrently it's a lot better.
Tested with just bench-async
| Platform | message count | duration |
|---|---|---|
macos |
100k | 1.176s |
windows |
100k | 0.368s |
linux |
100k | 4.026s |
I was able to replicate this on Ubuntu. Wonder where the performance loss is occuring
Hello, we've been testing this benchmark on our own systems. When plugged we see benchmark results in line with the ones you have posted for non-Linux platforms, @alshdavid. That said, we've noticed that power saving mode or throttling due to being unplugged has a massive effect on the results. For instance, when I switch my machine to "Power Saver" in Gnome the results I get are:
Ryzen 7 7840U / Ubuntu
| Power Save | Count | Duration |
|---|---|---|
| Off | 100k | 3.072s |
| Off | 1m | 34.654s |
| On | 100k | 7.392s |
| On | 1m | 71.326s |
Macbook M3 Max
| Energy Mode | Count | Duration |
|---|---|---|
| High | 100k | 2.389s |
| High | 1m | 22.772s |
| Low | 100k | 2.720s |
| Low | 1m | 26.808s |
Perhaps what's happening here is that the Linux implementation is very sensitive to power saving mode.
I can confirm the same (i.e worse performance on power saving and numbers on par with OP's Windows and MacOS for performance mode) on NixOS 24.05, 24 × 12th Gen Intel® Core™ i7-12800HX, 64GB RAM
| message count | power saving | perfomance mode |
|---|---|---|
| 100K | 21.441s | 3.821s |
| 1M | 67.033s | 24.460s |
Although the above measurements show Linux to be ten times slower than macOS and five times slower than Windows, it's not clear to me why this is unexpected. The platform layer has distinct code for Linux, macOS, and Windows based on completely different OS primitives, so some performance differences would not be surprising. In particular, I wonder if the macOS support risks using Mach ports, rather than BSD features, for better performance.
I'm also intrigued whether a factor of ten in these benchmarks represents a measurable performance problem for Servo (or other projects consuming IPC channel, if there are any).
(I found one Servo issue specifically about layout of real world web pages being up to two times slower on Linux, when using "parallel" rather than "sequential" layout, but I have no idea if that could be caused by IPC channel performance differences.)
We were evaluating using IPC channels at Atlassian for a project that has a Rust core which calls out to an external processes (Nodejs and other runtimes) to execute "plugin" code.
The messaging overhead on Linux machines however made it impractical so that had us looking at alternative options. IPC is certainly still preferred as it's far simpler and a much nicer mental model than the alternatives.
Thanks @alshdavid. Although Servo is probably the main consumer of IPC channel, I would be grateful for more information about your use case:
- How much faster would IPC channel have had to be to make its use practical for you?
- Did you find an alternative on Linux with acceptable performance?
- If so, was the alternative IPC-based, did it avoid IPC completely, or what?
We are writing web build tooling, specifically the Atlaspack bundler, in Rust to help improve the feedback loop for developers working on internal projects.
At the moment Atlaspack is a fork of Parcel that is being incrementally rewritten to Rust.
The Rust core needs to call out to plugins written in JavaScript (essentially middleware for phases of the build). We intend to expand support for other languages.
Nodejs has the capability to consume Rust code in the form of a dynamic c library where we use node's bindings to convert the Rust API to JavaScript (Go, Python, etc also share this capability)
The initial thinking was that we could create a separate Nodejs package that acted as a client for the IPC server provided by the core. That way, to add language support, we just need to create a new language specific client package that consumes the IPC API we design.
The problem is this is a very chatty (millions of requests over IPC) and the overhead quickly adds up to be substantial.
Alternatives
Embed the runtime
One option we looked at is embedding the runtime within the core, either statically or loading it as a dynamic c library
The downside to this is increases the binary size, locks the version of Nodejs to the one supplied by the library (can cause incompatibilities), complicates the story for statically compiled libraries, and increases the complexity/build time for CI/CD.
Wasm / Wasi
Maybe? Need to explore this option further
Embed the bundler within Nodejs
This is what we are currently going with until we can think of a better solution. The Rust port will take some time so we are hoping we will find a better solution eventually
This involves building the entire bundler as a Nodejs NAPI module (compiling the bundler as a dynamic C library consumed by a Nodejs entry point) and running it from within a Nodejs host process.
This limits the ability to use different languages, increases the complexity and is harder to reason about as the entrypoint is a JavaScript file that jumps into Rust which jumps back and forth into Nodejs & Nodejs worker threads.
When compared to this approach - 1 million IPC messages adds an overhead of +30s to +60s which is important because we are aiming to have an overall complete build time of ~60s.
That's helpful - thank you. So it seems we don't yet have evidence that any multi-process implementation could perform sufficiently well for very chatty use cases such as yours on Linux.
Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?
I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.
Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).
Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap
Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?
I believe IPC channel is predicated on (de)serialising values sent across the channel. So I suspect "direct" transmission of values is beyond the scope of IPC channel.
I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.
Shared memory or memory mapped files are likely part of any performant solution. Indeed the current implementation already uses shared memory.
These resources may be useful:
https://users.rust-lang.org/t/shared-memory-for-interprocess-communication/92408 https://stackoverflow.com/questions/14225010/fastest-technique-to-pass-messages-between-processes-on-linux
Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).
Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap
I personally think sharing (part of) the Rust heap between processes is a non-starter. It might be possible to build a library for managing shared memory or memory-mapped files as a way of passing values between processes, but that's likely to be a large piece of work.
That said, it feels to me that this discussion is going beyond an issue against the current IPC channel implementation and is getting into the realm of speculating about better alternatives. Would you be comfortable closing the issue?
True, I am happy to close this issue. Thanks for helping out 🙏
@alshdavid Thanks and I wish you good progress with https://github.com/atlassian-labs/atlaspack.