SimEng icon indicating copy to clipboard operation
SimEng copied to clipboard

TX2 L1 cache bandwidth inaccuracy

Open jrprice opened this issue 5 years ago • 0 comments

Building an L1 cache bandwidth benchmark (e.g. STREAM triad-only, no OpenMP, with small arrays), with certain compilers (e.g. Clang), currently generates extremely inaccurate results (SimEng achieves much more bandwidth than real hardware - from memory it should be around 70 GB/s per core whereas we get >100 GB/s).

The issue comes from the fact that Clang generates 128-bit ldp instructions, which will each load 32 bytes of data (2x16). Since we model a TX2 with two LSUs and do not micro-op, this means we are issuing 64 bytes of load requests every cycle (one ldp on each LSU every cycle). The LSUs in TX2 actually have a limit of 16 bytes/cycle each, so 32 bytes/cycle total.

There are two ways to address this. One option is to micro-op these instructions into two independent 16-byte loads, which is what TX2 actually does. The other option is to introduce the ability to describe the load throughput limit in the LSQ or LSUs, and buffer pending loads to resume in the next cycle once the limit has been reached (stalling if necessary).

The latter should be much simpler to implement. The former should be more accurate (since it captures more than just the bandwidth limitation), but requires drastically more implementation effort and complexity (we do not model micro-oping at all at present, and this introduces many tangential concerns).

jrprice avatar Nov 09 '19 05:11 jrprice