tokio rt: optimize spawn functions

Motivation

Tokio doesn't peform well when spawing large futures. There are many function calls on spawn code path. Futures are passed as parameter values between these functions. Compilers can't optimize out data copies very well. For larger futures (several or tens of KBs), the impact on performance cannot be ignored. Similarly, for devices with limited stack space, the waste of stack space also needs consideration.

Under rust 1.58, using the default configuration of release mode, spawning a 3MB future without await on JoinHandle will cause stackoverflow. The stack size is 8MB. This might means that there are three copies on the stack before the future is placed in task cell on the heap.

In addition, there is extra overhead when await async tasks. In the implementation of take_output, even if the output type is unit type (), the entire CoreStage is copied. It's weird that llvm doesn't optimize this code.

pub(super) fn take_output(&self) -> super::Result<T::Output> {
    use std::mem;

    self.stage.with_mut(|ptr| {
        // Safety:: the caller ensures mutual exclusion to the field.
        match mem::replace(unsafe { &mut *ptr }, Stage::Consumed) {
            Stage::Finished(output) => output,
            _ => panic!("JoinHandle polled after completion"),
        }
    })
}

It would be good for performance and stack space, if we could eliminate or reduce data copying when spawning futures.

Solution

Add a benchmark to test the performance when spawning futures of different sizes. The spawned futures only contain an uninitialized memory space, they will do nothing and become ready immediately.

Add a new type UninitTask. UninitTask creates an uninitialized task cell(or partially initialized to be precise) on the heap, which onlys contains a valid future. UninitTask can bind to a scheduler to complete initialization (update task header, vtable, trailer, etc) and convert into a RawTask lately.

In this way the spawn function can move futures to the heap immediately. Since the code path is shorter, it's easier for compiler to optimize out some data copy with inline hint.

When spawning a future on runtime, we don't know the actual scheduler type(basic or thread pool) before checking the runtime handle. The solution is to make the task cell independent of the scheduler used by runtime. This is achieved by wrapping handle of basic scheduler and thread pool scheduler in a union.

Function take_output is rewritten.

Benchmark

Rust: 1.58.1-x86_64-unknown-linux-gnu CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

Before optimization:

+--------+-------------------------------------------+-------------------------------------------+
|        |              current_thread               |               multi_thread                |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|  size  | spawn 1  | spawn 10 | spawn 50 | block_on | spawn 1  | spawn 10 | spawn 50 | block_on |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|     16 |   218 ns |  1451 ns |  6509 ns |   104 ns |  5731 ns |  7951 ns | 17890 ns |    72 ns |
|     32 |   225 ns |  1557 ns |  7191 ns |   103 ns |  5673 ns |  8719 ns | 17121 ns |    71 ns |
|     64 |   221 ns |  1603 ns |  7817 ns |   107 ns |  5603 ns |  8199 ns | 17898 ns |    81 ns |
|    128 |   225 ns |  1551 ns |  7938 ns |   104 ns |  5687 ns |  8232 ns | 19541 ns |    74 ns |
|    256 |   231 ns |  1602 ns |  8523 ns |   106 ns |  5758 ns |  8898 ns | 20555 ns |    74 ns |
|    512 |   234 ns |  1666 ns |  9353 ns |   112 ns |  5724 ns |  8664 ns | 19269 ns |    75 ns |
|   1024 |   289 ns |  2183 ns | 11227 ns |   123 ns |  5821 ns |  8722 ns | 21506 ns |    79 ns |
|   2048 |   313 ns |  2726 ns | 12730 ns |   134 ns |  5810 ns |  9623 ns | 25184 ns |    91 ns |
|   4096 |   386 ns |  3674 ns | 34464 ns |   171 ns |  6240 ns |  9325 ns | 48225 ns |   103 ns |
|   8192 |   593 ns |  5796 ns |105036 ns |   275 ns |  6303 ns |  9997 ns |127638 ns |   135 ns |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+

After:

+--------+-------------------------------------------+-------------------------------------------+
|        |              current_thread               |               multi_thread                |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|  size  | spawn 1  | spawn 10 | spawn 50 | block_on | spawn 1  | spawn 10 | spawn 50 | block_on |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+
|     16 |   216 ns |  1407 ns |  6348 ns |   101 ns |  5741 ns |  7453 ns | 17668 ns |    73 ns |
|     32 |   216 ns |  1487 ns |  6808 ns |   103 ns |  5543 ns |  8428 ns | 20845 ns |    72 ns |
|     64 |   218 ns |  1537 ns |  7552 ns |   103 ns |  5813 ns |  8405 ns | 17758 ns |    71 ns |
|    128 |   212 ns |  1516 ns |  7569 ns |   104 ns |  5669 ns |  7596 ns | 17560 ns |    74 ns |
|    256 |   219 ns |  1447 ns |  7572 ns |   107 ns |  5678 ns |  8174 ns | 20528 ns |    76 ns |
|    512 |   218 ns |  1494 ns |  7633 ns |   114 ns |  5757 ns |  8087 ns | 20844 ns |    79 ns |
|   1024 |   244 ns |  1712 ns |  8234 ns |   121 ns |  5612 ns |  8874 ns | 20940 ns |    83 ns |
|   2048 |   258 ns |  1850 ns |  8001 ns |   132 ns |  5703 ns |  8919 ns | 21308 ns |    91 ns |
|   4096 |   247 ns |  1753 ns | 20062 ns |   167 ns |  5760 ns |  9056 ns | 32155 ns |   105 ns |
|   8192 |   255 ns |  1739 ns | 39311 ns |   268 ns |  5652 ns |  8721 ns | 51768 ns |   152 ns |
+--------+----------+----------+----------+----------+----------+----------+----------+----------+

When spawning 50 futures of 512 bytes size (pretty common usage I guess) on current thread, the performance improvement has been relatively obvious (9353ns vs 7633ns). For very large futures, the performance improvement brought by optimization is very significant.

Feb 10 '22 03:02 yiyuanliu

I'm not a big fan of the UninitTask strategy. How about the following alternative:

Create an "owning" raw-pointer type which can own a value on the stack, but be pointer-sized itself.
Instead of passing around the future, pass around the owning stack pointer.
After allocating memory, move ownership of the future from the stack into the allocation.

Feb 10 '22 09:02 Darksonn

I'm not a big fan of the UninitTask strategy. How about the following alternative:

Create an "owning" raw-pointer type which can own a value on the stack, but be pointer-sized itself.

Instead of passing around the future, pass around the owning stack pointer.

After allocating memory, move ownership of the future from the stack into the allocation.

I have considered this option before finally moving to UninitTask.

I think UninitTask gives the compiler more opportunity to do optimization. For owning stack pointers, there are still many function calls before the future is moved to the heap. The optimization compiler can do are limited. I'm not sure but I guess in this case the future will still be initialized on stack and then copied to heap. Although performance is improved compared to the existing code, there is still a memory copy.

With UninitTask, the compiler can sometimes even initialize futures on heap directly. While this optimization doesn't always work, in benchmark it appears there is still some data copy because spawn performance drops as the future size grows, but UninitTask does give the compiler more opportunity to optimize.

Feb 11 '22 02:02 yiyuanliu

Any progress of this nice pr ? :)

Oct 12 '22 04:10 gustav3d

The PR probably is not getting merged in its current state. It adds a lot of complexity, and I still believe in the alternate strategy that I suggested earlier. Rust isn't even able to elide the init-on-stack-first with Box::new(LargeStruct { ... }), so I don't see how it would ever be able to do so in tokio::spawn.

Oct 12 '22 13:10 Darksonn

Regardless, thank you for taking the time to submit the PR. It is useful to have an implementation of this strategy with benchmarks, so that we may compare the performance if I ever get around to implementing my earlier suggestion.

Oct 12 '22 13:10 Darksonn

tokio tokio copied to clipboard

rt: optimize spawn functions

Motivation

Solution

Benchmark

tokio
tokio copied to clipboard