toolchain
toolchain copied to clipboard
[gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc - optimize LMBench bw_mem frd/fwr
LMBench memory bandwidth tests frd() and fwr() access consecutive 512 bytes to compute memory subystem bandwidth.
void fwr(iter_t iterations, void *cookie) <-- mem write consecutive words [Report #4]
{
...
register int *p = state->buf;
p[0]= p[1]= p[2]= p[3]= p[4]= p[5]= p[6]=
p[7]= p[8]= p[9]= p[10]= p[11]= p[12]=
p[13]= p[14]= p[15]= p[16]= p[17]= p[18]=
p[19]= p[20]= p[21]= p[22]= p[23]= p[24]=
...
p[123]= p[124]= p[125]= p[126]= p[127]= 1;
p += 128;
}
At -O2 the normal (boring) generated code use regular ST instructions (both upstream gcc, GNU 2020.03)
fwr:
...
.L83:
st.as 1,[r2,127]
st.as 1,[r2,126]
...
st.as 1,[r2,64]
st 1,[r2,252]
st 1,[r2,248]
...
st 1,[r2,4]
st 1,[r2]
add r2,r2,512 # p, p,
cmp_s r3,r2 # lastone, p
bhs @.L83
At -Os, gcc from github fork enables store merging, coalescing 2 consecutive word store ST into a single STD double store
.L53:
brhi r0, r13, @.L52 #, p, lastone,
mov_s r2,1
mov_s r3,1
std r2,[r0,8]
std r2,[r0,16]
...
std r2,[r0,248]
std.as r2,[r0,64]
std.as r2,[r0,66]
...
std.as r2,[r0,126]
st 1,[r0,4]
st 1,[r0]
add r0,r0,512
b_s @.L53
This improves Memory Write Bandwidth by over 20%
Back in 2018 Claudiu had pushed a ARC gcc patch to whcih enabled peephole2 patterns for generating LDD/STD [PATCH 4/6] [ARC] Add peephole rules to combine store/loads into double store/loads
However it seems there is one more patch (in generic code) [MAINLINE][HACK] Allow store merging using 64-bit std instructions. which is not merged into upstream and w/o this the peephole doesn't kick in.
So to summarize
- LDD peephole doesn't work at all
- enable the LDD/STD peephole for upstream gcc too
Yeah, my speech about custom Vs mainline. The store merging is done with the help of the mod which I did in upstream, and, probably it will not be in until I don't find another architecture which will benefit from it. LATE EDIT: I'll check if I can make it to work without that hack ;)
The autovectorizer should take care of it.