coil
coil copied to clipboard
Improve RealImageLoader.execute to properly call async request
I was investigating the performance of rememberAsyncImagePainter and found out the AsyncImagePainter.onRemembered is the longest call inside ~0.25ms for each image.
When I was investigating more, I think the reason of this is that the RealImageLoader.execute request uses async call with dispatcher Main.immediate, but the coroutine is already called with Main.immediate. If I understand things correclty, this can't be called in parallel and therefore is called sequentially, hence the duration.
This should improve #1866 .
before
ScrollBenchmark_fullCompilation
coil.compose.rememberAsyncImagePainter%AverageMs min 0,5, median 0,6, max 0,7
coil.compose.rememberAsyncImagePainter%Count min 156,0, median 173,0, max 190,0
coil.compose.rememberAsyncImagePainter%Ms min 81,0, median 99,9, max 128,2
timeToInitialDisplayMs min 525,2, median 547,1, max 579,6
frameDurationCpuMs P50 10,5, P90 31,0, P95 42,1, P99 315,3
frameOverrunMs P50 -7,5, P90 17,1, P95 63,4, P99 1 170,9
Traces: Iteration 0 1 2 3 4 5 6 7 8 9
after
ScrollBenchmark_fullCompilation
coil.compose.rememberAsyncImagePainter%AverageMs min 0,3, median 0,4, max 0,5
coil.compose.rememberAsyncImagePainter%Count min 154,0, median 174,0, max 184,0
coil.compose.rememberAsyncImagePainter%Ms min 51,4, median 70,6, max 83,1
timeToInitialDisplayMs min 521,4, median 563,9, max 605,9
frameDurationCpuMs P50 9,6, P90 30,7, P95 45,1, P99 301,4
frameOverrunMs P50 -8,1, P90 22,8, P95 52,3, P99 1 172,8
Traces: Iteration 0 1 2 3 4 5 6 7 8 9
From the benchmarks, I can see that the frame duration was improved slightly for the scroll.
before : frameDurationCpuMs P50 10,5, P90 31,0, P95 42,1, P99 315,3
after : frameDurationCpuMs P50 9,6, (-9%) P90 30,7, (-1%) P95 45,1, (+6%) P99 301,4 (-5%)
This solution is probably not sufficient as the dispatcher probably needs to be passed, or called somewhere else, or even this might be done differently all together.
I have experimented with this in 2.x branch, but I see a similar code is in 3.x.
Thanks for taking a look at this! Though I'm a bit confused - won't async(Dispatchers.Main.immediate) always run sequentially (even if wrapped with withContext(Dispatchers.Default)) since Dispatchers.Main.immediate is a single threaded dispatcher?
Unfortunately I don't think we can use an non-main thread dispatcher here as it breaks returning images from the memory cache synchronously - this is the failing test.
Do you know if RealImageLoader.executeMain is taking up the bulk of the AsyncImagePainter.onRemembered time? There are some things that Coil needs to run on the main thread, but I'm curious if that's what's slowing it down at the moment or if it's something else.
I sort of expected something to break, but I was a bit in a rush and wanted to start the conversation going 😇
I pushed my wip code with experiments (ScrollBenchmark) that can produce the system trace and a method trace.
Here's the breakdown of the method trace
You can run the benchmark as well and get the method trace for analysis. Given it's a method trace, we don't look at the timing information, because everything is skewed, but more on the ratio between sections.
Okay, I tried a bit playing with this if something could be triggered earlier, but I'm not able to solve the memory cache blocking request. I thought there could be something running in parallel with the async call and await it at the apropriate time to basically have something like "if not ready, block". But I don't understand the details of when things are called.
Interesting thanks for digging in! Looks like there are two main avenues to improve perf like you called out:
- Speed up
scope.launchinAsyncImagePainter.onRemembered: I'm not sure what we can do here given the bulk of the time looks to be coroutines library-related. The result of the code insidescope.launchshould be pretty quick. - Don't block when
ConstraintSizeResolver.sizeis called: I'm surprised that this is a large percentage of the runtime givenSizeResolver.sizeshould suspend instead of block while waiting for non-zero constraints. Do you know what could be the issue?
-
Part of the
scope.launchinonRememberedis because of reading the state, which happens twice -requestandimageLoader. I'm thinking whether theimageLoaderhas to be as amutableStateOf. I don't see this used for compose directly, so this might be a small optimization. -
The other thing might also be to not use
mutableStateOffor therequestparameter. I don't see it being read anywhere in the project, is it a niche case when someone might want to listen to this outside? Could this be aMutableStateFlow, which is used for everyone and used ascollectAsState()composable if anyone needs listening to this? -
Could the blocking nature of
ConstraintSizeResolver.sizebe related to the fact it usesDispatchers.Main.immediate.
I experimented with removing the State variables, there's some minor change in overall duration, but the FPS impact might be too noisy.
There's very minor API change.
Before:
AsyncImagePainter.onRememberedAverageMs min 0.1, median 0.1, max 0.1
AsyncImagePainter.onRememberedCount min 130.0, median 139.0, max 146.0
AsyncImagePainter.onRememberedMs min 9.3, median 12.5, max 16.3
frameDurationCpuMs P50 4.1, P90 9.1, P95 20.0, P99 99.3
After:
AsyncImagePainter.onRememberedAverageMs min 0.1, median 0.1, max 0.1
AsyncImagePainter.onRememberedCount min 130.0, median 141.0, max 148.0
AsyncImagePainter.onRememberedMs min 8.3, median 10.5, max 15.8
frameDurationCpuMs P50 4.4, P90 8.6, P95 12.0, P99 95.1
I played a bit more with what could be improved and the async call within RealImageLoader.execute is only needed for Views. So if we tweak the the code to use that only for Views, we can get rid of some coroutines overhead.
Load detail image
Before (any tweaks)
ScrollBenchmark_fullCompilation
AsyncImagePainter.onRememberedAverage_µs min 518.6, median 671.3, max 733.1
AsyncImagePainter.onRememberedCount min 2.0, median 2.0, max 2.0
AsyncImagePainter.onRemembered_µs min 1,037.1, median 1,342.7, max 1,466.3
rememberAsyncImagePainterAverage_µs min 1,147.3, median 1,497.2, max 1,638.2
rememberAsyncImagePainterCount min 1.0, median 1.0, max 1.0
rememberAsyncImagePainter_µs min 1,147.3, median 1,497.2, max 1,638.2
timeToInitialDisplayMs min 217.2, median 230.0, max 272.5
frameDurationCpuMs P50 2.9, P90 54.0, P95 58.5, P99 66.0
frameOverrunMs P50 -12.3, P90 42.7, P95 62.2, P99 81.4
Traces: Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
After
ScrollBenchmark_fullCompilation
AsyncImagePainter.onRememberedAverage_µs min 346.3, median 402.9, max 714.2
AsyncImagePainter.onRememberedCount min 2.0, median 2.0, max 2.0
AsyncImagePainter.onRemembered_µs min 692.5, median 805.9, max 1,428.5
rememberAsyncImagePainterAverage_µs min 784.9, median 942.1, max 1,641.6
rememberAsyncImagePainterCount min 1.0, median 1.0, max 1.0
rememberAsyncImagePainter_µs min 784.9, median 942.1, max 1,641.6
timeToInitialDisplayMs min 198.0, median 225.1, max 236.8
frameDurationCpuMs P50 2.9, P90 53.2, P95 57.2, P99 68.1
frameOverrunMs P50 -12.3, P90 41.0, P95 68.3, P99 79.2
Traces: Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
This shows ~58% improvement for one image.
The results would be better taken with Microbenchmark, so it needs to be taken with a grain of salt.
@mlykotom This is awesome, thanks! Do you want to merge this in with the scroll perf benchmark?
To improve scroll perf even more, I think if we offer an opt-in option for Compose (or maybe the default in 3.x?) we should be able to make executeMain thread agnostic (though, I think ConstraintsSizeResolve.size will still run on the UI thread). Currently, it only requires it to interact with views, lifecycles, and invoke listener methods, which aren't necessary for AsyncImage/AsyncImagePainter.
Hey, sorry I was super busy this week and couldn't get to it. I will clean up the PR to be in a submittable shape.
To improve scroll perf even more, I think if we offer an opt-in option for Compose (or maybe the default in 3.x?) we should be able to make executeMain thread agnostic (though, I think ConstraintsSizeResolve.size will still run on the UI thread). Currently, it only requires it to interact with views, lifecycles, and invoke listener methods, which aren't necessary for AsyncImage/AsyncImagePainter.
I think this sounds really great to think about the threads in 3.x, I would vote for having it default, but definitelly needs some thinking.
@mlykotom Thanks for all the work on this! I'll take a look through the other optimizations this weekend. Unfortunately we can't make any binary incompatible changes to the 2.x branch, but we should consider incorporating some of them into 3.x (the main branch). I'll also upstream this change to main.