coil icon indicating copy to clipboard operation
coil copied to clipboard

Improve RealImageLoader.execute to properly call async request

Open mlykotom opened this issue 1 year ago • 9 comments

I was investigating the performance of rememberAsyncImagePainter and found out the AsyncImagePainter.onRemembered is the longest call inside ~0.25ms for each image. image

When I was investigating more, I think the reason of this is that the RealImageLoader.execute request uses async call with dispatcher Main.immediate, but the coroutine is already called with Main.immediate. If I understand things correclty, this can't be called in parallel and therefore is called sequentially, hence the duration.

This should improve #1866 .

before

ScrollBenchmark_fullCompilation
coil.compose.rememberAsyncImagePainter%AverageMs   min     0,5,   median     0,6,   max     0,7
coil.compose.rememberAsyncImagePainter%Count       min   156,0,   median   173,0,   max   190,0
coil.compose.rememberAsyncImagePainter%Ms          min    81,0,   median    99,9,   max   128,2
timeToInitialDisplayMs                             min   525,2,   median   547,1,   max   579,6
frameDurationCpuMs                                 P50     10,5,   P90     31,0,   P95     42,1,   P99    315,3
frameOverrunMs                                     P50     -7,5,   P90     17,1,   P95     63,4,   P99  1 170,9
Traces: Iteration 0 1 2 3 4 5 6 7 8 9

after

ScrollBenchmark_fullCompilation
coil.compose.rememberAsyncImagePainter%AverageMs   min     0,3,   median     0,4,   max     0,5
coil.compose.rememberAsyncImagePainter%Count       min   154,0,   median   174,0,   max   184,0
coil.compose.rememberAsyncImagePainter%Ms          min    51,4,   median    70,6,   max    83,1
timeToInitialDisplayMs                             min   521,4,   median   563,9,   max   605,9
frameDurationCpuMs                                 P50      9,6,   P90     30,7,   P95     45,1,   P99    301,4
frameOverrunMs                                     P50     -8,1,   P90     22,8,   P95     52,3,   P99  1 172,8
Traces: Iteration 0 1 2 3 4 5 6 7 8 9

From the benchmarks, I can see that the frame duration was improved slightly for the scroll.

before : frameDurationCpuMs  P50     10,5,   P90     31,0,   P95     42,1,   P99    315,3
after    : frameDurationCpuMs  P50      9,6, (-9%)   P90     30,7, (-1%)   P95     45,1, (+6%)  P99    301,4 (-5%)

This solution is probably not sufficient as the dispatcher probably needs to be passed, or called somewhere else, or even this might be done differently all together.

I have experimented with this in 2.x branch, but I see a similar code is in 3.x.

mlykotom avatar Apr 16 '24 13:04 mlykotom

Thanks for taking a look at this! Though I'm a bit confused - won't async(Dispatchers.Main.immediate) always run sequentially (even if wrapped with withContext(Dispatchers.Default)) since Dispatchers.Main.immediate is a single threaded dispatcher?

Unfortunately I don't think we can use an non-main thread dispatcher here as it breaks returning images from the memory cache synchronously - this is the failing test.

Do you know if RealImageLoader.executeMain is taking up the bulk of the AsyncImagePainter.onRemembered time? There are some things that Coil needs to run on the main thread, but I'm curious if that's what's slowing it down at the moment or if it's something else.

colinrtwhite avatar Apr 17 '24 04:04 colinrtwhite

I sort of expected something to break, but I was a bit in a rush and wanted to start the conversation going 😇

I pushed my wip code with experiments (ScrollBenchmark) that can produce the system trace and a method trace. Here's the breakdown of the method trace image

You can run the benchmark as well and get the method trace for analysis. Given it's a method trace, we don't look at the timing information, because everything is skewed, but more on the ratio between sections.

mlykotom avatar Apr 17 '24 08:04 mlykotom

Okay, I tried a bit playing with this if something could be triggered earlier, but I'm not able to solve the memory cache blocking request. I thought there could be something running in parallel with the async call and await it at the apropriate time to basically have something like "if not ready, block". But I don't understand the details of when things are called.

mlykotom avatar Apr 17 '24 10:04 mlykotom

Interesting thanks for digging in! Looks like there are two main avenues to improve perf like you called out:

  • Speed up scope.launch in AsyncImagePainter.onRemembered: I'm not sure what we can do here given the bulk of the time looks to be coroutines library-related. The result of the code inside scope.launch should be pretty quick.
  • Don't block when ConstraintSizeResolver.size is called: I'm surprised that this is a large percentage of the runtime given SizeResolver.size should suspend instead of block while waiting for non-zero constraints. Do you know what could be the issue?

colinrtwhite avatar Apr 22 '24 21:04 colinrtwhite

  1. Part of the scope.launch in onRemembered is because of reading the state, which happens twice - request and imageLoader. I'm thinking whether the imageLoader has to be as a mutableStateOf. I don't see this used for compose directly, so this might be a small optimization.

  2. The other thing might also be to not use mutableStateOf for the request parameter. I don't see it being read anywhere in the project, is it a niche case when someone might want to listen to this outside? Could this be a MutableStateFlow, which is used for everyone and used as collectAsState() composable if anyone needs listening to this?

  3. Could the blocking nature of ConstraintSizeResolver.size be related to the fact it uses Dispatchers.Main.immediate.

mlykotom avatar Apr 23 '24 09:04 mlykotom

I experimented with removing the State variables, there's some minor change in overall duration, but the FPS impact might be too noisy.

There's very minor API change.

Before:

AsyncImagePainter.onRememberedAverageMs        min   0.1,   median   0.1,   max   0.1
AsyncImagePainter.onRememberedCount            min 130.0,   median 139.0,   max 146.0
AsyncImagePainter.onRememberedMs               min   9.3,   median  12.5,   max  16.3
frameDurationCpuMs                             P50    4.1,   P90    9.1,   P95   20.0,   P99   99.3

After:

AsyncImagePainter.onRememberedAverageMs        min   0.1,   median   0.1,   max   0.1
AsyncImagePainter.onRememberedCount            min 130.0,   median 141.0,   max 148.0
AsyncImagePainter.onRememberedMs               min   8.3,   median  10.5,   max  15.8
frameDurationCpuMs                             P50    4.4,   P90    8.6,   P95   12.0,   P99   95.1

mlykotom avatar Apr 23 '24 20:04 mlykotom

I played a bit more with what could be improved and the async call within RealImageLoader.execute is only needed for Views. So if we tweak the the code to use that only for Views, we can get rid of some coroutines overhead.

Load detail image

Before (any tweaks)

ScrollBenchmark_fullCompilation
AsyncImagePainter.onRememberedAverage_µs   min   518.6,   median   671.3,   max   733.1
AsyncImagePainter.onRememberedCount        min     2.0,   median     2.0,   max     2.0
AsyncImagePainter.onRemembered_µs          min 1,037.1,   median 1,342.7,   max 1,466.3
rememberAsyncImagePainterAverage_µs        min 1,147.3,   median 1,497.2,   max 1,638.2
rememberAsyncImagePainterCount             min     1.0,   median     1.0,   max     1.0
rememberAsyncImagePainter_µs               min 1,147.3,   median 1,497.2,   max 1,638.2
timeToInitialDisplayMs                     min   217.2,   median   230.0,   max   272.5
frameDurationCpuMs                         P50      2.9,   P90     54.0,   P95     58.5,   P99     66.0
frameOverrunMs                             P50    -12.3,   P90     42.7,   P95     62.2,   P99     81.4
Traces: Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

After

ScrollBenchmark_fullCompilation
AsyncImagePainter.onRememberedAverage_µs   min   346.3,   median   402.9,   max   714.2
AsyncImagePainter.onRememberedCount        min     2.0,   median     2.0,   max     2.0
AsyncImagePainter.onRemembered_µs          min   692.5,   median   805.9,   max 1,428.5
rememberAsyncImagePainterAverage_µs        min   784.9,   median   942.1,   max 1,641.6
rememberAsyncImagePainterCount             min     1.0,   median     1.0,   max     1.0
rememberAsyncImagePainter_µs               min   784.9,   median   942.1,   max 1,641.6
timeToInitialDisplayMs                     min   198.0,   median   225.1,   max   236.8
frameDurationCpuMs                         P50      2.9,   P90     53.2,   P95     57.2,   P99     68.1
frameOverrunMs                             P50    -12.3,   P90     41.0,   P95     68.3,   P99     79.2
Traces: Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

This shows ~58% improvement for one image.

The results would be better taken with Microbenchmark, so it needs to be taken with a grain of salt.

mlykotom avatar Apr 24 '24 09:04 mlykotom

@mlykotom This is awesome, thanks! Do you want to merge this in with the scroll perf benchmark?

To improve scroll perf even more, I think if we offer an opt-in option for Compose (or maybe the default in 3.x?) we should be able to make executeMain thread agnostic (though, I think ConstraintsSizeResolve.size will still run on the UI thread). Currently, it only requires it to interact with views, lifecycles, and invoke listener methods, which aren't necessary for AsyncImage/AsyncImagePainter.

colinrtwhite avatar Apr 25 '24 04:04 colinrtwhite

Hey, sorry I was super busy this week and couldn't get to it. I will clean up the PR to be in a submittable shape.

To improve scroll perf even more, I think if we offer an opt-in option for Compose (or maybe the default in 3.x?) we should be able to make executeMain thread agnostic (though, I think ConstraintsSizeResolve.size will still run on the UI thread). Currently, it only requires it to interact with views, lifecycles, and invoke listener methods, which aren't necessary for AsyncImage/AsyncImagePainter.

I think this sounds really great to think about the threads in 3.x, I would vote for having it default, but definitelly needs some thinking.

mlykotom avatar Apr 29 '24 19:04 mlykotom

@mlykotom Thanks for all the work on this! I'll take a look through the other optimizations this weekend. Unfortunately we can't make any binary incompatible changes to the 2.x branch, but we should consider incorporating some of them into 3.x (the main branch). I'll also upstream this change to main.

colinrtwhite avatar May 10 '24 05:05 colinrtwhite