rav1e Add high-bit-depth assembly to match existing 8-bit assembly

Add high-bit-depth assembly to match existing 8-bit assembly

Open shssoichiro opened this issue 4 years ago • 8 comments

HBD is slow. This issue is a placeholder for implementing ASM to make it less slow.

Oct 25 '19 01:10 shssoichiro

See also the dav1d tracking issues for AVX2 SIMD, SSSE2 SIMD and NEON SIMD. Decoder kernels are best contributed first to dav1d as there is rigorous testing and expert review there.

Oct 25 '19 12:10 barrbrain

To aid prioritizing the many steps required to advance HBD performance, it would be informative to have at lease one perf trace of a short HBD encoding session.

Nov 14 '19 05:11 barrbrain

Top 10 time consuming functions, from a 720p 4:2:0 10-bit test clip (10 frames):

Speed 6:


Function / Call Stack	CPU Time	Module	Function (Full)	Source File	Start Address
rav1e::mc::native::put_8tap::h8d763cc7981fb980	3.140s	rav1e	rav1e::mc::native::put_8tap::h8d763cc7981fb980(void)	mc.rs	0x5d8c0
rav1e_sad_16x16_hbd_ssse3.loop	1.770s	rav1e	rav1e_sad_16x16_hbd_ssse3.loop	[Unknown]	0x39066a
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d	1.320s	rav1e	rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void)	inverse.rs	0x285a00
rav1e::transform::inverse_transform_add::h0533325d37ee0796	1.280s	rav1e	rav1e::transform::inverse_transform_add::h0533325d37ee0796(void)	mod.rs	0x175700
rav1e::transform::forward_transform::hf86b2ea599d15347	1.250s	rav1e	rav1e::transform::forward_transform::hf86b2ea599d15347(void)	mod.rs	0x295130
rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0	1.240s	rav1e	rav1e::mc::native::prep_8tap::h0e99f8e17376d2a0(void)	mc.rs	0x5f200
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3	1.210s	rav1e	rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void)	rdo.rs	0x90b40
rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4	1.190s	rav1e	rav1e::lrf::sgrproj_stripe_filter::h69becf16919d36b4(void)	lrf.rs	0x8dc20
rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc	1.080s	rav1e	rav1e::rdo::rdo_loop_decision::hcbedfc3024f853cc(void)	rdo.rs	0x962f0
rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a	0.920s	rav1e	rav1e::dist::native::get_satd::he0d6c3dc8cd0d43a(void)	dist.rs	0xb4020

Speed 2:


Function / Call Stack	CPU Time	Module	Function (Full)	Source File	Start Address
rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85	8.130s	rav1e	rav1e::context::ContextWriter::write_coeffs_lv_map::hd551ab0e36c71e85(void)	context.rs	0x2a4f70
rav1e::transform::forward_transform::hf86b2ea599d15347	6.100s	rav1e	rav1e::transform::forward_transform::hf86b2ea599d15347(void)	mod.rs	0x295130
rav1e::transform::inverse_transform_add::h0533325d37ee0796	4.800s	rav1e	rav1e::transform::inverse_transform_add::h0533325d37ee0796(void)	mod.rs	0x175700
rav1e::mc::native::put_8tap::h8d763cc7981fb980	4.310s	rav1e	rav1e::mc::native::put_8tap::h8d763cc7981fb980(void)	mc.rs	0x5d8c0
rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5	3.710s	rav1e	rav1e::quantize::QuantizationContext::quantize::h067499ad062589d5(void)	quantize.rs	0x1737c0
rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3	3.490s	rav1e	rav1e::rdo::cdef_dist_wxh_8x8::h7626b3a672a364f3(void)	rdo.rs	0x90b40
rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d	2.491s	rav1e	rav1e::transform::inverse::av1_idct32::h5c9018c52454bf2d(void)	inverse.rs	0x285a00
_$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e	2.460s	rav1e	_$LT$rav1e..ec..WriterBase$LT$S$GT$$u20$as$u20$rav1e..ec..Writer$GT$::symbol_with_update::hdf6cb18c2114d03e(void)	ec.rs	0x1a7ed0
rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5	2.240s	rav1e	rav1e::encoder::encode_tx_block::h1fe2a2d7f7ac92b5(void)	encoder.rs	0xd1ce0
rav1e_sad_16x16_hbd_ssse3.loop	1.790s	rav1e	rav1e_sad_16x16_hbd_ssse3.loop	[Unknown]	0x39066a

Nov 14 '19 08:11 shssoichiro

I've discovered uftrace seems to generally be more accurate than Intel VTune. Here's a speed 6 trace of top 10 functions by self time from current master:

  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
    4.018  m    4.018  m        8350  linux:schedule // this is the total encode time
   31.682  s   31.673  s     6207461  rav1e::transform::inverse_transform_add
   27.967  s   27.958  s    66985185  rav1e::dist::native::get_satd
    1.013  m   27.028  s     6212387  rav1e::encoder::encode_tx_block
   26.350  s   26.342  s    11944802  rav1e::mc::native::put_8tap
   26.249  s   18.917  s     3626645  rav1e::rdo::compute_distortion
   10.724  s   10.721  s     6825792  rav1e::mc::native::prep_8tap
   35.364  s    9.573  s       14400  rav1e::rdo::rdo_loop_decision
    9.030  s    9.027  s   166607105  _<rav1e..ec..WriterBase<S> as rav1e..ec..Writer>::symbol_with_update
    8.895  s    8.894  s       39120  rav1e::me::full_search
   12.086  s    8.348  s      636259  rav1e::lrf::sgrproj_stripe_filter

Dec 25 '19 06:12 shssoichiro

Please add the command you used to produce it.

cargo-flamegraph works almost decently:

$ cargo flamegraph -b rav1e -- ~/Samples/Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m --tiles 32 -o /dev/null

perf keeps pointing out we are doing a strange amount of memmoves, they are possibly related to the rollback mechanism (I should try to see if I can find a sane way to trace them by extending memory-profiler or using callgrind).

Dec 25 '19 14:12 lu-zero

Ah, sure, here is what I used:

$ uftrace record --no-libcall -- target/release/rav1e ~/Downloads/KristenAndSara_42010p.y4m -o /dev/null
$ uftrace report --sort self

Dec 25 '19 14:12 shssoichiro

Did a quick test on 20 1080p frames from the same video with 8-bit and 10-bit version. This is the current performance delta:

FPS	8-bit	10-bit
s 10	1.236	0.282	4.38x
s 8	0.459	0.170	2.70x
s 6	0.448	0.163	2.75x
s 4	0.199	0.072	2.76x
s 2	0.179	0.057	3.14x

rav1e 1955d6d, i5-4590

Feb 13 '20 20:02 EwoutH

It's been a year since we checked in on this, here's the current top priority functions for potential HBD ASM as tested at various speeds on a i7-9700. Percents are out of total CPU cycles.

Speed 10 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 10):

34% - inverse_transform_add 14% - av1_idct* 6.5% - cdef_dist_wxh_8x8

Speed 5 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 5):

20% - inverse_transform_add 10% - av1_idct* 8% - sgrproj_box_ab_r1 5.5% - cdef_dist_wxh_8x8

Speed 0 (perf record -F max -g ~/Downloads/rav1e-pre ~/data/objective-1-fast-10bit/KristenAndSara_1280x720_60f.y4m -o /dev/null -s 0 --limit 10):

28% - inverse_transform_add 10% - av1_idct* 4.5% - cdef_dist_wxh_8x8

Mar 30 '21 22:03 shssoichiro

At this time, cdef_dist_wxh_8x8 or its components likely top the list.

Dec 04 '22 07:12 barrbrain

Most if not all of these should be addressed by now. If there are any remaining, we can create individual issues for them.

Jul 07 '23 19:07 shssoichiro

rav1e rav1e copied to clipboard

Add high-bit-depth assembly to match existing 8-bit assembly

rav1e
rav1e copied to clipboard