toolchain [gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc

[gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc - optimize LMBench bw_mem frd/fwr

Open vineetgarc opened this issue 4 years ago • 1 comments

LMBench memory bandwidth tests frd() and fwr() access consecutive 512 bytes to compute memory subystem bandwidth.

void fwr(iter_t iterations, void *cookie)		<-- mem write consecutive words [Report #4]
{
...
     register int *p = state->buf;

  p[0]= p[1]= p[2]= p[3]= p[4]= p[5]= p[6]=
  p[7]= p[8]= p[9]= p[10]= p[11]= p[12]=
  p[13]= p[14]= p[15]= p[16]= p[17]= p[18]=
  p[19]= p[20]= p[21]= p[22]= p[23]= p[24]=
...
  p[123]= p[124]= p[125]= p[126]= p[127]= 1;
  p += 128;
     }

At -O2 the normal (boring) generated code use regular ST instructions (both upstream gcc, GNU 2020.03)

fwr:
...
.L83:
	st.as	1,[r2,127]
	st.as	1,[r2,126]
...
	st.as	1,[r2,64]
	st	1,[r2,252]
	st	1,[r2,248]
...
	st	1,[r2,4]
	st	1,[r2]

	add r2,r2,512	# p, p,
	cmp_s r3,r2    # lastone, p
	bhs @.L83

At -Os, gcc from github fork enables store merging, coalescing 2 consecutive word store ST into a single STD double store

.L53:
	brhi r0, r13, @.L52		#, p, lastone,

	mov_s	r2,1
	mov_s	r3,1
	std r2,[r0,8]
	std r2,[r0,16]
...
	std r2,[r0,248]
	std.as r2,[r0,64]
	std.as r2,[r0,66]	
...
	std.as r2,[r0,126]
	st	1,[r0,4]
	st	1,[r0]

	add r0,r0,512
	b_s @.L53

This improves Memory Write Bandwidth by over 20%

Back in 2018 Claudiu had pushed a ARC gcc patch to whcih enabled peephole2 patterns for generating LDD/STD [PATCH 4/6] [ARC] Add peephole rules to combine store/loads into double store/loads

However it seems there is one more patch (in generic code) [MAINLINE][HACK] Allow store merging using 64-bit std instructions. which is not merged into upstream and w/o this the peephole doesn't kick in.

So to summarize

LDD peephole doesn't work at all
enable the LDD/STD peephole for upstream gcc too

Nov 06 '20 03:11 vineetgarc

Yeah, my speech about custom Vs mainline. The store merging is done with the help of the mod which I did in upstream, and, probably it will not be in until I don't find another architecture which will benefit from it. LATE EDIT: I'll check if I can make it to work without that hack ;)

Nov 06 '20 08:11 claziss

The autovectorizer should take care of it.

Jan 03 '23 14:01 claziss

toolchain toolchain copied to clipboard

[gcc] peephole2 to generate double load/stores not kicking in upstream ARC gcc - optimize LMBench bw_mem frd/fwr

toolchain
toolchain copied to clipboard