avif-native without rayon feature spawning a lot of threads
This happens in latest nightly using the avif-native feature, with default-features = false and rayon not set.
Expected
Images to be loaded without using additional threads.
Actual behaviour
Each time an image is loaded, 8 threads (on my system) are spawned and used to load the image.
Reproduction steps
cargo run --example decode --features=avif-native --no-default-features -- /path/to/file.avif
You can check with rust-gdb target/debug/examples/decode and then r /path/to/file.avif and you'll get a lot of threads created and destroyed messages.
For my usage case, I have tons of small files (< 512 bytes), and the overhead is quite significant with constant thread spawning.
I would guess this is the default behaviour of libdav1d and thus propagated through dav1d-rs.
If you can figure out what crate specifically is spawning these threads, I'd suggest filing an issue there. I don't think image is ever requesting multithreaded operation for AVIF
Breakpoint 1, 0x00007ffff7a96c18 in pthread_create () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7a96c18 in pthread_create () from /usr/lib/libc.so.6
#1 0x00007ffff7d9618e in dav1d_open () from /usr/lib/libdav1d.so.7
#2 0x000055555568a7b6 in dav1d::Decoder<dav1d::DefaultAllocator>::with_settings (settings=0x7fffffffb018) at src/lib.rs:520
#3 0x000055555568a8e9 in dav1d::Decoder<dav1d::DefaultAllocator>::new () at src/lib.rs:536
#4 0x00005555555ab870 in image::codecs::avif::decoder::AvifDecoder<std::io::buffered::bufreader::BufReader<std::fs::File>>::new<std::io::buffered::bufreader::BufReader<std::fs::File>> (r=...) at src/codecs/avif/decoder.rs:82
#5 0x00005555555ae49c in image::io::image_reader_type::ImageReader<std::io::buffered::bufreader::BufReader<std::fs::File>>::make_decoder<std::io::buffered::bufreader::BufReader<std::fs::File>> (format=..., reader=..., limits_for_png=<error reading variable: Cannot access memory at address 0x10>)
at src/io/image_reader_type.rs:181
#6 0x00005555555aea4f in image::io::image_reader_type::ImageReader<std::io::buffered::bufreader::BufReader<std::fs::File>>::decode<std::io::buffered::bufreader::BufReader<std::fs::File>> (self=...) at src/io/image_reader_type.rs:315
#7 0x00005555555a7fb2 in image::images::dynimage::open<&std::path::Path> (path=...) at src/images/dynimage.rs:1624
#8 0x000055555559c39f in decode::main () at examples/decode.rs:15
(gdb)
So image-rs is creating dav1d::Decoder with default settings. From libdav1d [reference]:
int [n_threads](https://videolan.videolan.me/dav1d/structDav1dSettings.html#a81f9045c63eebdc9bee5414b4001f92d)
number of threads (0 = number of logical cores in host system, default 0)
A small patch like:
diff --git a/src/codecs/avif/decoder.rs b/src/codecs/avif/decoder.rs
index fa7ee75c..a3c18ab0 100644
--- a/src/codecs/avif/decoder.rs
+++ b/src/codecs/avif/decoder.rs
@@ -79,7 +79,14 @@ impl<R: Read> AvifDecoder<R> {
let ctx = read_avif(&mut r, ParseStrictness::Normal).map_err(error_map)?;
let coded = ctx.primary_item_coded_data().unwrap_or_default();
+ #[cfg(feature = "rayon")]
let mut primary_decoder = dav1d::Decoder::new().map_err(error_map)?;
+ #[cfg(not(feature = "rayon"))]
+ let mut primary_decoder = {
+ let mut settings = dav1d::Settings::new();
+ settings.set_n_threads(1);
+ dav1d::Decoder::with_settings( &settings )
+ }.map_err(error_map)?;
primary_decoder
.send_data(coded.to_vec(), None, None, None)
.map_err(error_map)?;
No longer triggers pthread_create, and I've confirmed working. However, I'm not sure whether a patch like this would make sense philosophically. It might make sense to just expose number of threads like how the av1f encoder does it or something like that instead.
Conflating threads and rayon together doesn't really make sense. As reminded in the linked issue there is a mode of rayon that does not create threads and also using rayon implies existence of a thread-local threadpool or a global one. While we want to expose controls over maximum parallelism as a resource (both native and within a thread pool) I feel that a cfg is not the way to go about it.
We could conceivably have a feature in 1.0 (not-default-on) to enable any form threads use but I'm not entirely sure what's the point? Threads are spawned, sure, but how does it affect you? 8 is not what I would call a lot and the risk to semantics compared to allocation-oom is neglible (also in contrast to allocators the thread API has try semantics so one should check if it is mitigated better).
Conflating threads and
rayontogether doesn't really make sense. As reminded in the linked issue there is a mode of rayon that does not create threads and also usingrayonimplies existence of a thread-local threadpool or a global one. While we want to expose controls over maximum parallelism as a resource (both native and within a thread pool) I feel that acfgis not the way to go about it.We could conceivably have a feature in
1.0(not-default-on) to enable any form threads use but I'm not entirely sure what's the point? Threads are spawned, sure, but how does it affect you? 8 is not what I would call a lot and the risk to semantics compared to allocation-oom is neglible (also in contrast to allocators thethreadAPI hastrysemantics so one should check if it is mitigated better).
My issue is I'm doing loading already with rayon, and I've found that libdav1d has non-trivial overhead when loading with multiple threads, that if you get unlucky with many large images loading in parallel can lead the system to swap. Additionally, many of the images are really small in some cases, so it'll end up spawning 8 or more threads to load 400 bytes, which feels a bit overhead. Being able to disable all image-rs threading and just handle it in the outer just ends up being a simpler and cleaner solution. While not setting the rayon flag works for almost all image formats, avif currently always creates as many threads as cores and no way to control that behaviour is exposed. I did not have an issue when using webp/png.
It sounds like a lot of the problem is that there's a bunch of threads being spawned even when the image is tiny. If we instead only spawned threads when the overhead would be minimal compared to the overall decode time, that might be sufficient?
It sounds like a lot of the problem is that there's a bunch of threads being spawned even when the image is tiny. If we instead only spawned threads when the overhead would be minimal compared to the overall decode time, that might be sufficient?
Yes, having some better heuristic would be an improvement over the current implementation, even if it doesn't provide full control like allowing disabling threading.