rav1e
rav1e copied to clipboard
Add high-bit-depth assembly to match existing 8-bit assembly
HBD is slow. This issue is a placeholder for implementing ASM to make it less slow.
See also the dav1d
tracking issues for AVX2 SIMD, SSSE2 SIMD and NEON SIMD. Decoder kernels are best contributed first to dav1d as there is rigorous testing and expert review there.
To aid prioritizing the many steps required to advance HBD performance, it would be informative to have at lease one perf
trace of a short HBD encoding session.
Top 10 time consuming functions, from a 720p 4:2:0 10-bit test clip (10 frames):
Speed 6:
Function / Call Stack | CPU Time | Module | Function (Full) | Source File | Start Address |
rav1e::mc::native::put_8tap::h8d763cc7981fb980 | 3.140s | rav1e | rav1e::mc::native::put_8tap::h8d763cc7981fb980(void) | mc.rs | 0x5d8c0 |
rav1e_sad_16x16_hbd_ssse3.loop | 1.770s | rav1e | rav1e_sad_16x16_hbd_ssse3.loop | [Unknown] | 0x39066a |
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d | 1.320s | rav1e | rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void) | inverse.rs | 0x285a00 |
rav1e::transform::inverse_transform_add::h0533325d37ee0796 | 1.280s | rav1e | rav1e::transform::inverse_transform_add::h0533325d37ee0796(void) | mod.rs | 0x175700 |
rav1e::transform::forward_transform::hf86b2ea599d15347 | 1.250s | rav1e | rav1e::transform::forward_transform::hf86b2ea599d15347(void) | mod.rs | 0x295130 |
rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0 | 1.240s | rav1e | rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0(void) | mc.rs | 0x5f200 |
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3 | 1.210s | rav1e | rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void) | rdo.rs | 0x90b40 |
rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4 | 1.190s | rav1e | rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4(void) | lrf.rs | 0x8dc20 |
rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc | 1.080s | rav1e | rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc(void) | rdo.rs | 0x962f0 |
rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a | 0.920s | rav1e | rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a(void) | dist.rs | 0xb4020 |
Speed 2:
Function / Call Stack | CPU Time | Module | Function (Full) | Source File | Start Address |
rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85 | 8.130s | rav1e | rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85(void) | context.rs | 0x2a4f70 |
rav1e::transform::forward_transform::hf86b2ea599d15347 | 6.100s | rav1e | rav1e::transform::forward_transform::hf86b2ea599d15347(void) | mod.rs | 0x295130 |
rav1e::transform::inverse_transform_add::h0533325d37ee0796 | 4.800s | rav1e | rav1e::transform::inverse_transform_add::h0533325d37ee0796(void) | mod.rs | 0x175700 |
rav1e::mc::native::put_8tap::h8d763cc7981fb980 | 4.310s | rav1e | rav1e::mc::native::put_8tap::h8d763cc7981fb980(void) | mc.rs | 0x5d8c0 |
rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5 | 3.710s | rav1e | rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5(void) | quantize.rs | 0x1737c0 |
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3 | 3.490s | rav1e | rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void) | rdo.rs | 0x90b40 |
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d | 2.491s | rav1e | rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void) | inverse.rs | 0x285a00 |
_$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e | 2.460s | rav1e | _$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e(void) | ec.rs | 0x1a7ed0 |
rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5 | 2.240s | rav1e | rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5(void) | encoder.rs | 0xd1ce0 |
rav1e_sad_16x16_hbd_ssse3.loop | 1.790s | rav1e | rav1e_sad_16x16_hbd_ssse3.loop | [Unknown] | 0x39066a |
I've discovered uftrace seems to generally be more accurate than Intel VTune. Here's a speed 6 trace of top 10 functions by self time from current master:
Total time Self time Calls Function
========== ========== ========== ====================
4.018 m 4.018 m 8350 linux:schedule // this is the total encode time
31.682 s 31.673 s 6207461 rav1e::transform::inverse_transform_add
27.967 s 27.958 s 66985185 rav1e::dist::native::get_satd
1.013 m 27.028 s 6212387 rav1e::encoder::encode_tx_block
26.350 s 26.342 s 11944802 rav1e::mc::native::put_8tap
26.249 s 18.917 s 3626645 rav1e::rdo::compute_distortion
10.724 s 10.721 s 6825792 rav1e::mc::native::prep_8tap
35.364 s 9.573 s 14400 rav1e::rdo::rdo_loop_decision
9.030 s 9.027 s 166607105 _<rav1e..ec..WriterBase<S> as rav1e..ec..Writer>::symbol_with_update
8.895 s 8.894 s 39120 rav1e::me::full_search
12.086 s 8.348 s 636259 rav1e::lrf::sgrproj_stripe_filter
Please add the command you used to produce it.
cargo-flamegraph
works almost decently:
$ cargo flamegraph -b rav1e -- ~/Samples/Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m --tiles 32 -o /dev/null
perf keeps pointing out we are doing a strange amount of memmoves, they are possibly related to the rollback mechanism (I should try to see if I can find a sane way to trace them by extending memory-profiler or using callgrind).
Ah, sure, here is what I used:
$ uftrace record --no-libcall -- target/release/rav1e ~/Downloads/KristenAndSara_42010p.y4m -o /dev/null
$ uftrace report --sort self
Did a quick test on 20 1080p frames from the same video with 8-bit and 10-bit version. This is the current performance delta:
FPS | 8-bit | 10-bit | |
---|---|---|---|
s 10 | 1.236 | 0.282 | 4.38x |
s 8 | 0.459 | 0.170 | 2.70x |
s 6 | 0.448 | 0.163 | 2.75x |
s 4 | 0.199 | 0.072 | 2.76x |
s 2 | 0.179 | 0.057 | 3.14x |
rav1e 1955d6d, i5-4590
It's been a year since we checked in on this, here's the current top priority functions for potential HBD ASM as tested at various speeds on a i7-9700. Percents are out of total CPU cycles.
Speed 10 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 10
):
34% - inverse_transform_add 14% - av1_idct* 6.5% - cdef_dist_wxh_8x8
Speed 5 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 5
):
20% - inverse_transform_add 10% - av1_idct* 8% - sgrproj_box_ab_r1 5.5% - cdef_dist_wxh_8x8
Speed 0 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 0 --limit 10
):
28% - inverse_transform_add 10% - av1_idct* 4.5% - cdef_dist_wxh_8x8
At this time, cdef_dist_wxh_8x8
or its components likely top the list.
Most if not all of these should be addressed by now. If there are any remaining, we can create individual issues for them.