risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

perf(storage): reduce calling async method with poll_next in iterator

Open wenym1 opened this issue 2 years ago • 11 comments

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

In previous benchmark, we have observed that calling too many async method may cause performance regression even if all types are static. Besides, we observe that in most call on next, we don't need to call async method and await. In most cases it's just advancing a logical pointer without any necessity to await on anything. Therefore, in this PR, we split the frequently called method next into two parts, poll_next and await_next. The poll_next will do all the non-async stuffs, and therefore it's not an async method, and the remaining async stuffs in next will be done in await_next. The poll_next method will return a core::task::poll enum, where Poll::Ready means that the next can be finished without calling await_next, and Poll::Pending means that we should further call await_next to finish the remaining async stuff.

The next will become

fn next(&mut self) -> Self::NextFuture<'_> {
        async {
            match self.poll_next() {
                Poll::Ready(result) => result,
                Poll::Pending => self.await_next().await,
            }
        }
    }

We run the benchmark bench_compactor and observe great improvement. Main: image

Current branch: image

The merge iterator using the normal concat sstable iterator wrapped by union has 35% improvement. The merge iterator using the normal concat sstable iterator has 21% improvement. The merge iterator using the the sstable-prefetched iterator has 5% improvement.

Checklist

  • [x] I have written necessary rustdoc comments
  • [x] I have added necessary unit tests and integration tests
  • [x] All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

wenym1 avatar Aug 08 '22 12:08 wenym1

Also the overhead of compactor is not caused by async / await itself. It's caused by the async_trait macro, which converts all Future into Box dyn Future. If we are using GAT everywhere, there should be little performance overhead.

skyzh avatar Aug 08 '22 12:08 skyzh

The poll_next naming seems confusing to the real Future::poll_next. :(

By the way, if there's no dyn Future on the path, I think the compiler is able to realize that and the generated state machine should be the same as calling a sync function.

BugenZhao avatar Aug 08 '22 12:08 BugenZhao

Codecov Report

Merging #4501 (6cb113f) into main (cc4224d) will increase coverage by 0.00%. The diff coverage is 86.02%.

@@           Coverage Diff            @@
##             main    #4501    +/-   ##
========================================
  Coverage   74.56%   74.56%            
========================================
  Files         848      848            
  Lines      124164   124332   +168     
========================================
+ Hits        92578    92705   +127     
- Misses      31586    31627    +41     
Flag Coverage Δ
rust 74.56% <86.02%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/storage/src/lib.rs 100.00% <ø> (ø)
src/storage/src/hummock/iterator/mod.rs 56.66% <57.14%> (-2.03%) :arrow_down:
...e/src/hummock/shared_buffer/shared_buffer_batch.rs 92.95% <63.63%> (-0.92%) :arrow_down:
...ge/src/hummock/iterator/compact_concat_iterator.rs 67.70% <71.42%> (-3.18%) :arrow_down:
src/storage/src/hummock/iterator/concat_inner.rs 90.51% <88.23%> (-1.44%) :arrow_down:
src/storage/src/hummock/iterator/merge_inner.rs 86.22% <94.28%> (-2.05%) :arrow_down:
...e/src/hummock/sstable/backward_sstable_iterator.rs 96.77% <95.83%> (-0.85%) :arrow_down:
...ge/src/hummock/sstable/forward_sstable_iterator.rs 93.54% <97.56%> (+0.48%) :arrow_up:
... and 4 more

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

codecov[bot] avatar Aug 08 '22 12:08 codecov[bot]

The poll_next naming seems confusing to the real Future::poll_next. :(

By the way, if there's no dyn Future on the path, I think the compiler is able to realize that and the generated state machine should be the same as calling a sync function.

Yes, I just realized that. This PR is not implementing Future. It just splits the non-async thing out of the original function. I would prefer not naming it as poll_next, and don't use std::task::Poll here to avoid confusion.

skyzh avatar Aug 08 '22 12:08 skyzh

By the way, if there's no dyn Future on the path, I think the compiler is able to realize that and the generated state machine should be the same as calling a sync function.

I used to think so. But some benchmark result is really confusing, and it seems that we can get great improvement if we avoid using await. I suspect that async method can cause some regression for compiler code optimization.

wenym1 avatar Aug 08 '22 12:08 wenym1

I would prefer not naming it as poll_next, and don't use std::task::Poll here to avoid confusion.

Do you have any suggestion on the naming? I have been thinking about it the first day I wrote this PR 🤣

wenym1 avatar Aug 08 '22 12:08 wenym1

Initially I guess the compiler somehow cannot do LTO without this PR, but I'm wrong. I removed these two lines from Cargo.toml.

codegen-units = 1
lto = 'thin'

After this PR, with LTO

bench_union_merge_iterator                                                                            
                        time:   [90.442 ms 90.546 ms 90.663 ms]

Before this PR, with LTO

bench_union_merge_iterator                                                                            
                        time:   [111.53 ms 111.67 ms 111.84 ms]

Before this PR, w/o LTO

bench_union_merge_iterator                                                                            
                        time:   [149.62 ms 149.88 ms 150.15 ms]

After this PR, w/o LTO

bench_union_merge_iterator                                                                            
                        time:   [127.23 ms 127.52 ms 127.81 ms]

So even w/o LTO, we can get performance improvement with this PR.

One thing to notice is that we currently don't enable LTO in release builds. With LTO we can get 149.88 ms -> 111.67 ms = 25% improvement. So LTO can already do a lot, we can enable LTO in release builds to have better performance. For this PR, I would prefer making the code clear to read, to the 15% further improvement.

skyzh avatar Aug 08 '22 12:08 skyzh

The poll_next naming seems confusing to the real Future::poll_next. :( By the way, if there's no dyn Future on the path, I think the compiler is able to realize that and the generated state machine should be the same as calling a sync function.

Yes, I just realized that. This PR is not implementing Future. It just splits the non-async thing out of the original function. I would prefer not naming it as poll_next, and don't use std::task::Poll here to avoid confusion.

+1. 🥵

BugenZhao avatar Aug 09 '22 06:08 BugenZhao

How about use poll_next in compact ? For example:

async fn compact(...) {
    while iter.is_valid() {
        iter.poll_next();
        if iter.is_pending() {
            iter.await_next().await;
        }
    }
}

Little-Wallace avatar Aug 10 '22 06:08 Little-Wallace

How about use poll_next in compact ? For example:

async fn compact(...) {
    while iter.is_valid() {
        iter.poll_next();
        if iter.is_pending() {
            iter.await_next().await;
        }
    }
}

The next of iter in this PR is already similar to

async fn next() {
    iter.poll_next();
    if iter.is_pending() {
        iter.await_next().await;
    }
}

So the mentioned

async fn compact(...) {
    while iter.is_valid() {
        iter.poll_next();
        if iter.is_pending() {
            iter.await_next().await;
        }
    }
}

will be almost the same as

async fn next() {
    while iter.is_valie() {
        iter.next().await
    }
}

except for one more layer of await called in compact.

wenym1 avatar Aug 10 '22 09:08 wenym1

But the previous method would not generate a structure NextFuture.... I think this structure NextFuture will be much heavy than Poll

Little-Wallace avatar Aug 10 '22 11:08 Little-Wallace