rav1e icon indicating copy to clipboard operation
rav1e copied to clipboard

Add high-bit-depth assembly to match existing 8-bit assembly

Open shssoichiro opened this issue 4 years ago • 8 comments

HBD is slow. This issue is a placeholder for implementing ASM to make it less slow.

shssoichiro avatar Oct 25 '19 01:10 shssoichiro

See also the dav1d tracking issues for AVX2 SIMD, SSSE2 SIMD and NEON SIMD. Decoder kernels are best contributed first to dav1d as there is rigorous testing and expert review there.

barrbrain avatar Oct 25 '19 12:10 barrbrain

To aid prioritizing the many steps required to advance HBD performance, it would be informative to have at lease one perf trace of a short HBD encoding session.

barrbrain avatar Nov 14 '19 05:11 barrbrain

Top 10 time consuming functions, from a 720p 4:2:0 10-bit test clip (10 frames):

Speed 6:

Function / Call Stack CPU Time Module Function (Full) Source File Start Address
rav1e::mc::native::put_8tap::h8d763cc7981fb980 3.140s rav1e rav1e::mc::native::put_8tap::h8d763cc7981fb980(void) mc.rs 0x5d8c0
rav1e_sad_16x16_hbd_ssse3.loop 1.770s rav1e rav1e_sad_16x16_hbd_ssse3.loop [Unknown] 0x39066a
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d 1.320s rav1e rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void) inverse.rs 0x285a00
rav1e::transform::inverse_transform_add::h0533325d37ee0796 1.280s rav1e rav1e::transform::inverse_transform_add::h0533325d37ee0796(void) mod.rs 0x175700
rav1e::transform::forward_transform::hf86b2ea599d15347 1.250s rav1e rav1e::transform::forward_transform::hf86b2ea599d15347(void) mod.rs 0x295130
rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0 1.240s rav1e rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0(void) mc.rs 0x5f200
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3 1.210s rav1e rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void) rdo.rs 0x90b40
rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4 1.190s rav1e rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4(void) lrf.rs 0x8dc20
rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc 1.080s rav1e rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc(void) rdo.rs 0x962f0
rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a 0.920s rav1e rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a(void) dist.rs 0xb4020

Speed 2:

Function / Call Stack CPU Time Module Function (Full) Source File Start Address
rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85 8.130s rav1e rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85(void) context.rs 0x2a4f70
rav1e::transform::forward_transform::hf86b2ea599d15347 6.100s rav1e rav1e::transform::forward_transform::hf86b2ea599d15347(void) mod.rs 0x295130
rav1e::transform::inverse_transform_add::h0533325d37ee0796 4.800s rav1e rav1e::transform::inverse_transform_add::h0533325d37ee0796(void) mod.rs 0x175700
rav1e::mc::native::put_8tap::h8d763cc7981fb980 4.310s rav1e rav1e::mc::native::put_8tap::h8d763cc7981fb980(void) mc.rs 0x5d8c0
rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5 3.710s rav1e rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5(void) quantize.rs 0x1737c0
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3 3.490s rav1e rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void) rdo.rs 0x90b40
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d 2.491s rav1e rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void) inverse.rs 0x285a00
_$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e 2.460s rav1e _$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e(void) ec.rs 0x1a7ed0
rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5 2.240s rav1e rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5(void) encoder.rs 0xd1ce0
rav1e_sad_16x16_hbd_ssse3.loop 1.790s rav1e rav1e_sad_16x16_hbd_ssse3.loop [Unknown] 0x39066a

shssoichiro avatar Nov 14 '19 08:11 shssoichiro

I've discovered uftrace seems to generally be more accurate than Intel VTune. Here's a speed 6 trace of top 10 functions by self time from current master:

  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
    4.018  m    4.018  m        8350  linux:schedule // this is the total encode time
   31.682  s   31.673  s     6207461  rav1e::transform::inverse_transform_add
   27.967  s   27.958  s    66985185  rav1e::dist::native::get_satd
    1.013  m   27.028  s     6212387  rav1e::encoder::encode_tx_block
   26.350  s   26.342  s    11944802  rav1e::mc::native::put_8tap
   26.249  s   18.917  s     3626645  rav1e::rdo::compute_distortion
   10.724  s   10.721  s     6825792  rav1e::mc::native::prep_8tap
   35.364  s    9.573  s       14400  rav1e::rdo::rdo_loop_decision
    9.030  s    9.027  s   166607105  _<rav1e..ec..WriterBase<S> as rav1e..ec..Writer>::symbol_with_update
    8.895  s    8.894  s       39120  rav1e::me::full_search
   12.086  s    8.348  s      636259  rav1e::lrf::sgrproj_stripe_filter

shssoichiro avatar Dec 25 '19 06:12 shssoichiro

Please add the command you used to produce it.

cargo-flamegraph works almost decently:

$ cargo flamegraph -b rav1e -- ~/Samples/Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m --tiles 32 -o /dev/null

perf keeps pointing out we are doing a strange amount of memmoves, they are possibly related to the rollback mechanism (I should try to see if I can find a sane way to trace them by extending memory-profiler or using callgrind).

lu-zero avatar Dec 25 '19 14:12 lu-zero

Ah, sure, here is what I used:

$ uftrace record --no-libcall -- target/release/rav1e ~/Downloads/KristenAndSara_42010p.y4m -o /dev/null
$ uftrace report --sort self

shssoichiro avatar Dec 25 '19 14:12 shssoichiro

Did a quick test on 20 1080p frames from the same video with 8-bit and 10-bit version. This is the current performance delta:

FPS 8-bit 10-bit
s 10 1.236 0.282 4.38x
s 8 0.459 0.170 2.70x
s 6 0.448 0.163 2.75x
s 4 0.199 0.072 2.76x
s 2 0.179 0.057 3.14x

rav1e 1955d6d, i5-4590

EwoutH avatar Feb 13 '20 20:02 EwoutH

It's been a year since we checked in on this, here's the current top priority functions for potential HBD ASM as tested at various speeds on a i7-9700. Percents are out of total CPU cycles.


Speed 10 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 10):

34% - inverse_transform_add 14% - av1_idct* 6.5% - cdef_dist_wxh_8x8


Speed 5 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 5):

20% - inverse_transform_add 10% - av1_idct* 8% - sgrproj_box_ab_r1 5.5% - cdef_dist_wxh_8x8


Speed 0 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 0 --limit 10):

28% - inverse_transform_add 10% - av1_idct* 4.5% - cdef_dist_wxh_8x8

shssoichiro avatar Mar 30 '21 22:03 shssoichiro

At this time, cdef_dist_wxh_8x8 or its components likely top the list.

barrbrain avatar Dec 04 '22 07:12 barrbrain

Most if not all of these should be addressed by now. If there are any remaining, we can create individual issues for them.

shssoichiro avatar Jul 07 '23 19:07 shssoichiro