Alexander Monakov
Alexander Monakov
Ooh, and I can stick an ALU op in the dependency chain without increasing overall latency, making best-case forwarding latency 3 cycles on SNB and IVB: ```nasm loop: mov eax,...
> So this loop runs in 4 cycles per iteration? Amazingly, yes!
Indeed, this runs at 3 cycles per iteration too. *Perfection.* ```nasm loop: mov [rsp], rdi imul rsp, 1 mov rdi, [rsp] dec ecx jnz loop ```
@blackout24, it's odd that it blocks that badly, but the issue seems tied to disk writeback. Can you try tuning page cache writeback according to http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ ?
How much RAM do you have? Try making background writeback more eager: `sudo sysctl vm.dirty_background_bytes=$[2**21]`
In steps 3.i and 3.ii most of the optimizations are _not enabled_, since you're not passing -O on the command line. This is a frequent "paper cut" with GCC command...
`-fvisibility=default` constraints some optimizations. Did a few tests with 4.8.1 at -O1. RTL DSE and postreload cse seem to be responsible for the huge memory consumption, `-fno-dse -fdbg-cnt=postreload_cse:0` is a...
Trunk still needs -fno-dse, but postreload cse seems to be improved a bit; still consumes a lot of memory, but does not explode like on 4.8.1.