rav1e Simd-ify multiple areas: fwd/inv tx, dist, and mc.

This uses target independent SIMD only, which I require for something I'm working on. It adds the packed_simd crate and thus requires a nightly compiler to use, though I've used no nightly features directly here.

This PR works by generating a rather large number of kernels in the build script; it clocks in at 2,392 kernels and a healthy margin over a million lines. Future work should probably look into trimming this count, by disabling kernels which have little to no impact on FPS.

All the dist functions (SAD and SATD) now have unrolled, vectorized "kernels" for all block types. SATD is now implemented only in registers (assuming no spillage; I didn't check every generated function, but those I looked at had no spills), requiring no initial plane difference calculation. Inv TX has some of the smaller blocks implemented.

There is a downside: compile times are now absolutely HORRIFIC for release builds. Currently, it takes around 30mins to build with a parallel compiler. Incremental dev builds with a parallel compiler are much more reasonable, which can be as low as 2-3mins on a 32T AMD 1950X TR.

Oct 03 '19 07:10 DiamondLovesYou

Is there a way to implement this such that it doesn't completely nuke the actually readable rust implementations? It feels a bit like reading old Avisynth plugins that were like:

int func(int a) {
    __asm {
        ... 10000 lines of asm
    }
}

or

int func(int a) {
    ... 10000 lines of intrinsics ...
}

And people have to reverse engineer what the heck theyre doing.

Oct 03 '19 14:10 dwbuiten

I guess we could add a nightly feature if there is consensus on going this route. ARM/aarch64 would need that as well for a while anyway.

Oct 07 '19 07:10 lu-zero

I like the generic intrinsics because they cover a lot of platforms that otherwise won't get love for a long time (arm, risc-v, webassembly, power).

I agree that maintaining both a scalar and vector path would be a lot more palatable for code maintenance.

Does turning off LTO make the compile times reasonable?

Oct 08 '19 00:10 tdaede

I've pushed an update. It now includes build script code to generate kernels for mc functions: put_8tap, prep_8tap, and mc_avg. The put_8tap and prep_8tap kernels have different versions for each of the col and row frac modes (ie == 0), target feature (native/none, sse2, ssse3, and avx2), and block size. mc_avg has kernels for each block size and target feature.

Currently every possible block size gets its own kernel, however it's possible many block sizes will be mostly unused and so would be better off with more generic loops over, say, smaller blocks. TODO?

The build script kernel generator is pretty ugly ATM. Future work will see the unrolling/looping decision logic generalized.

@tdaede sadly, no :frowning_face: Well, okay a bit, if you also disable ThinLTO. TLDR: No because currently a single codegen unit causes most of this compile time issue anyway.

Dev builds already don't build with full LTO, but they still use ThinLTO. Turning that off does result in a ~2m reduction, but requires use of RUSTFLAGS=-Z thinlto=no (lto=false in Cargo.toml is not enough). From -Z time-passes, it seems the ineffectiveness of parallel codegen to be mostly as result of lopsided codegen units: most units will do all of their codegen in <10s, but then a single codegen unit will take 5mins (like, 2m in thinLTO then 3m in LLVM codegen (ie LLVM IR->machine code)). Rust uses source modules (ie mod __) to determine codegen unit partitioning, so doing that will likely help (TODO: I plan on moving all these kernels to the build script, which will make it very simple to do such module load-balancing easily. Not sure how much this will help, and without a parallel Rust compiler the Rust parts will still take a while). See this module comment: https://github.com/rust-lang/rust/blob/master/src/librustc_mir/monomorphize/partitioning.rs

Interestingly, using a parallel Rust compiler (ie one built with parallel-compiler = true, like the Geobacter Rust compiler) helps quite a bit with the non-LLVM parts. But you'll have to build from source to use.

Release build compile time does see some benefit from disabling LTO, but I noticed an FPS drop along with it. This is a tough one. Full LTO requires a single codegen unit, thus resulting in a large N. I've also found that even with LLVM's Polly (I've kept the original Polly PR up-to-date in the Geobacter compiler), LLVM is just not very good at optimizing Rust code. I've seen almost no unrolling, and minimal vectorization, and lots of slice index checks (asm doesn't do these!! why should we? I mean, obviously we should, but we'll just do all the necessary checks once before jumping into the kernel and have the kernel do none). Thus, if we're to have good perf from Rust we're just going to have to do the unrolling and vectorization ourselves, thus we'll need to kernel-ize all the things. Which will mean LLVM will have a lot of work to do each build. Which means huge N. QED and goodnight. OTOH, disabling all debug info helped quite a bit: a quick N=1 test gave 40m 49s w/ debug=true and 28m 37s w/ debug=false to rebuild a white space change.

Phew that was long winded!

I agree that maintaining both a scalar and vector path would be a lot more palatable for code maintenance.

Alright. I'll have to revert some of the stuff I deleted, mostly in TX, but that shouldn't be a big issue as I think I'm going to use the build script to create unrolled/vectorized kernels for those anyway.

Poor man's perf comparison vs `--feature nasm`

This is not supposed to be rigorous; indeed N=2, though the asm side both finished within the same second of duration (within 3s w/ no asm), so theres that. The 4k original is available here; you'll need to resize it to 720p to match this encode.

Hardware is AMD TR 1950X, 64Gb ECC 2400Mhz (16Gb sticks), ASUS ROG Zenith Extreme. RUSTFLAGS=-C target-cpu=znver1 -Z external-macro-backtrace` (Needed because only the most recent addition uses any sort of target feature attribute).

XXX Sadly I gathered these measurements before I added assertions to the start of the mc functions to ensure the input blocks are big enough. Not sure how much of an effect that will have (the asserts are fairly cheap, though). Tackling this compile time issue is going to be priority; waiting around forever to build each is getting absurd.

No asm (so just the kernels in this PR):

$ time target/release/rav1e -s 6 --quantizer 60 ~/Videos/septuple_18-12-28_14-03-17.y4m --output ../septuple_18-12-28_14-03-17-s6-q60-upstreaming-simd.ivf --tiles 16
 INFO  rav1e > Using y4m decoder: 1280x720p @ 60/1 fps, 4:2:0, 8-bit
 INFO  rav1e > Encoding settings: keyint_min=12 keyint_max=240 quantizer=60 bitrate=0 min_quantizer=0 low_latency=false tune=Psychovisual rdo_lookahead_frames=40 min_block_size=8x8 multiref=true fast_deblock=false reduced_tx_set=true tx_domain_distortion=false tx_domain_rate=false encode_bottomup=false rdo_tx_decision=false prediction_modes=Complex-KFs include_near_mvs=false no_scene_detection=false diamond_me=true cdef=true quantizer_rdo=false use_satd_subpel=true non_square_partition=false
 INFO  rav1e::api::config > CPU Feature Level: AVX2
 INFO  rav1e::api::internal > Using 16 tiles (4x4)
 INFO  rav1e::stats         > encoded 1482 frames, 1.623 fps, 4179.11 Kb/s
 INFO  rav1e::stats         > ----------
 INFO  rav1e::stats         > Key frame:             8 | avg QP:  44.00 | avg size:   75164 B
 INFO  rav1e::stats         > Inter frame:        1474 | avg QP:  81.06 | avg size:    8345 B
 INFO  rav1e::stats         > Intra only frame:      0 | avg QP:   0.00 | avg size:       0 B
 INFO  rav1e::stats         > Switching frame:       0 | avg QP:   0.00 | avg size:       0 B
target/release/rav1e -s 6 --quantizer 60  --output  --tiles 16  6753.19s user 130.32s system 753% cpu 15:13.30 total

Asm:

$ time target/release/rav1e -s 6 --quantizer 60 ~/Videos/septuple_18-12-28_14-03-17.y4m --output ../septuple_18-12-28_14-03-17-s6-q60-upstreaming-simd.ivf --tiles 16
 INFO  rav1e > Using y4m decoder: 1280x720p @ 60/1 fps, 4:2:0, 8-bit
 INFO  rav1e > Encoding settings: keyint_min=12 keyint_max=240 quantizer=60 bitrate=0 min_quantizer=0 low_latency=false tune=Psychovisual rdo_lookahead_frames=40 min_block_size=8x8 multiref=true fast_deblock=false reduced_tx_set=true tx_domain_distortion=false tx_domain_rate=false encode_bottomup=false rdo_tx_decision=false prediction_modes=Complex-KFs include_near_mvs=false no_scene_detection=false diamond_me=true cdef=true quantizer_rdo=false use_satd_subpel=true non_square_partition=false
 INFO  rav1e::api::config > CPU Feature Level: AVX2
 INFO  rav1e::api::internal > Using 16 tiles (4x4)
 INFO  rav1e::stats         > encoded 1482 frames, 2.121 fps, 4176.41 Kb/s
 INFO  rav1e::stats         > ----------
 INFO  rav1e::stats         > Key frame:             8 | avg QP:  44.00 | avg size:   75164 B
 INFO  rav1e::stats         > Inter frame:        1474 | avg QP:  81.06 | avg size:    8340 B
 INFO  rav1e::stats         > Intra only frame:      0 | avg QP:   0.00 | avg size:       0 B
 INFO  rav1e::stats         > Switching frame:       0 | avg QP:   0.00 | avg size:       0 B
target/release/rav1e -s 6 --quantizer 60  --output  --tiles 16  4898.47s user 122.10s system 718% cpu 11:38.78 total

Interesting to see this PR use 5 extra bytes. TODO: measure against this PR's base.

An alternative method to compiling kernels

An option I've been thinking about is perhaps moving these kernels into their own crate (ideally with some way of forcing no LTO for just that crate) or otherwise compiling them into own objects outside of compiling librav1e proper. These functions are mostly/will be mostly called via function tables (selected by available ISA/block size/etc), and thus likely won't benefit from LTO anyway. My use case can't do this (LLVM AMDGPU does not support any indirect function calls), but for everyone else this should work great (I'll figure something out.. hopefully). The thing is is that these kernels also depend on the packed_simd crate (the rav1e references could be removed if needed, but not packed_simd), so we can't just manually invoke rustc inside of the build script in a similar manner to how the x86 asm is assembled and linked in (actually, this might work: cargo won't run our build script until all dependencies are built, which is needed when eg a dep exports some env var for C/C++ includes for downstream crates. Thus this could work albeit it's still pretty hacky: "all" we'd have to do is find the correct packed_simd rlib). Thoughts?

Oct 09 '19 18:10 DiamondLovesYou

Partitioning the mc kernels resulted in a massive speed up compiling librav1e: for a white space change: ~165s, of which ~120s is spent item checking (parallel compiler will help with this).

Sadly, rav1e the binary still eats a ton of time: a single codegen unit is taking 230s.

Down to 7 1/2 minutes for dev builds though!

Oct 10 '19 14:10 DiamondLovesYou

File and Write seem missing from the build system.

Oct 10 '19 15:10 lu-zero

Ah, I forgot I changed the import for the nasm stuff in the build script too.

Oct 10 '19 15:10 DiamondLovesYou

So

Master

41.74s syn 34.33s clap 30.90s rav1e 29.42s regex 27.55s regex_syntax 25.13s serde_derive 22.92s synstructure 17.57s serde 15.69s num_derive 14.51s cc 14.06s rav1e

your branch

491.67s rav1e 358.90s rav1e 66.11s packed_simd

This is quite crazy.

Oct 10 '19 16:10 lu-zero

If you could put a before and after numbers would be great.

Looks like this hopefully will lead to some issues filled for rustc/llvm :)

Oct 15 '19 08:10 lu-zero

@lu-zero I have a slightly old comparison, which doesn't include the recent SAD kernel rewrite: https://docs.google.com/spreadsheets/d/1oduiszODkHUk2FQG7-Hflb9KWiTw3CQIVnhto2d2Ldo/edit?usp=sharing

Averages ~203% base speed. Interestingly, -s 9 sees ~266% across all tested -q levels.

I haven't yet filed any LLVM bugs, mostly because this is unpaid volunteer work, when I should be job hunting! And honestly, I don't think this is that much of a bug: overzealous inlining is, most of the time, good (a joke in LLVM is that the inlining heuristic is "yes"). This particular case was just pathological: a "wrapper" function (ie a function which only calls another function) which is called by a macro which expands into nested matches. There is probably a rule in LLVM which will always inline these wrapper functions, which is Good(TM). The weird thing is that the wrapped function (FwdTxfm2D::fht in this case) is then also inlined, turning encode_tx_block into an Absolute Unit.

Okay, maybe that should be a bug. Ugh more work

Oct 16 '19 03:10 DiamondLovesYou

@DiamondLovesYou Nice work! Could you resolve the conflicts and rebase your branch on xiph:master?

Oct 16 '19 11:10 EwoutH

New bench results: now averages ~2.28x faster; -s 9 -q * is 2.84x faster. https://docs.google.com/spreadsheets/d/1oduiszODkHUk2FQG7-Hflb9KWiTw3CQIVnhto2d2Ldo/edit?usp=sharing

I think -s 9 is broken though... VLC can play, but it's pretty garbled. Base is fine.

@EwoutH Thanks for your interest, but this PR is slightly broken ATM (see ^), so I'd rather not rebase and have to redo the base benchmarks.

Oct 17 '19 18:10 DiamondLovesYou

Also, I've murdered compile times again...

Oct 17 '19 18:10 DiamondLovesYou

Fixed compile times for dev builds for the most part; release builds are still a PITA.

Oct 24 '19 23:10 DiamondLovesYou

@EwoutH Rebased, but asm support is broken ATM.

Oct 30 '19 13:10 DiamondLovesYou

Hi,

There seems lot of conflicts and it is more of out of sync with the codebase as a whole,

The huge build time seems to be quite a big blocker for adaption, would be interested to see the current status after rebasing it to current master, for further discussions,

Are you planning to work on this?

Jul 13 '20 14:07 vibhoothi

@vibhoothiiaanand Hi. I actually started a rebase a while ago now, but I kinda lost interest midway through. Supporting yet another set of kernels, without unified function table statics/dispatch for them, makes rebasing very messy.

There actually isn't too large of a compile time increase anymore (I mean, LTO will be out regardless, but LTO is a blunt instrument anyway; expecting to have a fast compile w/ LTO is silly); among other things, only hot kernels get specialized, ie non-specialized functions are always inlined into specialized functions where many parameters are constant, allowing LLVM to const-propagate and unroll loops. I've also figured out how to write loops in Rust so that LLVM will unroll.

I'm would be willing to resume, but only if I can get assurances that this work in general will be accepted in some form in a reasonable amount of time (this work actually isn't nearly as useful for GPGPU as I thought it would be.. so oops): I don't want to have to do messy rebases. I also have a new employer (AMD), so as a salaried employee I'd have to check in with them to make sure they are OK with me working on this in my free time (I exempted this as prior work, but new work doesn't count under that, I don't think).

Note: this also requires a nightly compiler, due to generic SIMD not being stable (though for good reasons, IMO, but that's another story), so it might be a better idea to wait until constant generics can be used for SIMD before this is given more effort. Additionally, packed_simd still hasn't been updated on crates.io!

Jul 17 '20 19:07 DiamondLovesYou

rav1e rav1e copied to clipboard

Simd-ify multiple areas: fwd/inv tx, dist, and mc.

Poor man's perf comparison vs --feature nasm

An alternative method to compiling kernels

rav1e
rav1e copied to clipboard

Poor man's perf comparison vs `--feature nasm`