hardware-effects
hardware-effects copied to clipboard
Additional hardware effects ideas
- [x] non-temporal stores
- [x] multiple threads saturating the memory bus
- [x] hardware prefetching with indexed accesses
- [x] floating point handling (denormals etc.)
- [x] 4k aliasing
- [x] store buffer capacity
- [ ] instruction cache misses
- [ ] TLB misses
- [ ] more multithreading examples (lock contention etc.)
- [ ] vector instructions
- [ ] critical word load
- [ ] CUDA examples
some hardware effects I have come across:
- [ ] loop optimized (or not) by Loop Stream Detector instruction queue replay
- [ ] loop/branch misalignment
- [ ] macro-fusable ops split on cache line boundary (Intel Core Architectures, Nehalem and newer)
@martisch The third one sounds extra juicy :) But also very CPU-specific I guess. If you have more specific ideas on how to demonstrate those effects, please do share :)
FWIW through testing I haven't been able to find any evidence of critical word first (CWF) on modern processors, but I would be very interesting in any test that shows it.
I can't really wrap my mind around how CWF would actually work on a system that has a 64-byte bus between L2 and L1 (like Skylake and later Intel CPUs): this implies that the entire 64-byte cache line goes from L2 to L1 in a single transfer, so no word is "first": they all arrive at the same time.
Even with smaller buses, like 16 or 32 bytes, it seems like the opportunity for CWF is very limited: probably only a cycle difference between the first and second half.
I asked on RWT about whether CWF is still used - but I got both "yes" and "no" answers and no solid conclusion.
I spent a few hours yesterday trying to simulate it, without success - but I'm no expert :) The list here are just random ideas/keywords taken from the web, I have no idea whether some of them can be demonstrated consistently at all.
I would expect the RAM controller to reorder some stuff it sends to the CPU, no idea if it's done in the caches. I think that it was mentioned in The memory paper (https://akkadia.org/drepper/cpumemory.pdf), but I don't remember it exactly.
One effect you might consider is demonstrating store buffer capacity.
I tried this (https://github.com/nicknash/GuessStoreBuffer - pretty horrible sorry!), and wrote what I understood to be going on at my blog (https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/) - any corrections very welcome!
I could code up much neater C++ version if you like.
@nicknash Hi, sorry for the late response. That is an awesome article and experiment! If you could prepare a C++ version similar to the ones that are already in this repo, I'd be happy to merge it. Please create a PR if you're interested and we can discuss it there.
@nicknash I took the liberty of adding this example myself, I mentioned your comment and blog post (which is very cool BTW :) ). https://github.com/Kobzol/hardware-effects/commit/c2627f838d6fa788866982cc9412c15fe5dcc4b6
@kobzol, that’s cool! It has been sitting on my todo list for much too long.