heapless icon indicating copy to clipboard operation
heapless copied to clipboard

SingleCore mode of spsc::Queue lost

Open birkenfeld opened this issue 4 years ago • 4 comments

During the const generics port, the SingleCore mode of the queue has been removed, but this is not mentioned in the changelog (like was done for the support for different index sizes).

Is it possible to reintroduce that mode? No need to generate all those memory barrier instructions if they are unneeded.

birkenfeld avatar Jun 27 '21 09:06 birkenfeld

Hi, this was chosen to make deriving correctness of the implementation easier. The issue with the old SingleCore mode is that it's not possible to auto detect, and even so, not quite sure it's correct on all Cortex-M MCUs.

Re-adding it from a coding standpoint is not an issue, showing correctness for all the targets is. Would you feel up to the task to try and show correctness for the SingleCore impl and make a small write-up about it with motivation?

Just out of curiosity, how much performance did you loose? In my tests it only impacts 3 clock cycles (Cortex-M3) on enqueue for example, so I thought no-one would miss the feature.

korken89 avatar Jun 30 '21 14:06 korken89

Would you feel up to the task to try and show correctness for the SingleCore impl and make a small write-up about it with motivation?

Hehe, that's out of my league, unfortunately.

I haven't measured speed as much as saw a 0.3k code size increase after upgrading to 0.7 and could trace that back to an (apparently unrolled) loop in a dequeue operation, where one of the additional instructions was a dmb sy.

If you're interested I can post the disassembly of the old/new comparison.

birkenfeld avatar Jun 30 '21 16:06 birkenfeld

Hehe I understand :) Please do, is like to have a look. I'd only expect a few bytes difference, not 0 .3k. Unless the compiler is not quite as good at optimizing const generics code yet..

korken89 avatar Jun 30 '21 16:06 korken89

Sorry for taking a while. The biggest single increase is in a loop

for &byte in some_slice {
    queue.enqueue_unchecked(byte);
}

It appears to be unrolled by the compiler to work in increments of 16. (Current nightly, lto = true, release with no specific opt-level set.)

The size of the loop is 412 bytes in 0.7 vs 194 bytes in 0.6.

heapless 0.7:

 8004644:	f810 3f10 	ldrb.w	r3, [r0, #16]!
 8004648:	f10b 0208 	add.w	r2, fp, #8
 800464c:	f8db 6004 	ldr.w	r6, [fp, #4]
 8004650:	5593      	strb	r3, [r2, r6]
 8004652:	1c73      	adds	r3, r6, #1
[repeated block starts here]
 8004654:	f3bf 8f5f 	dmb	sy
 8004658:	f36f 239f 	bfc	r3, #10, #22
 800465c:	f8cb 3004 	str.w	r3, [fp, #4]
 8004660:	7843      	ldrb	r3, [r0, #1]
 8004662:	f8db 6004 	ldr.w	r6, [fp, #4]
 8004666:	5593      	strb	r3, [r2, r6]
 8004668:	1c73      	adds	r3, r6, #1
[15 more copies of repeated block]
 800479e:	f3bf 8f5f 	dmb	sy
 80047a2:	f36f 229f 	bfc	r2, #10, #22
 80047a6:	f8cb 2004 	str.w	r2, [fp, #4]
 80047aa:	f100 0210 	add.w	r2, r0, #16
 80047ae:	428a      	cmp	r2, r1
 80047b0:	f47f af48 	bne.w	8004644 <_ZN16display_firmware18__cortex_m_rt_main17hfb688fea963e9033E+0x1358>

heapless 0.6 (sorry for the long snippet, but there doesn't seem to be a literally repeated block since the compiler is happily using lots of registers):

 80042d8:	4606      	mov	r6, r0
 80042da:	f10b 0304 	add.w	r3, fp, #4
 80042de:	f36f 269f 	bfc	r6, #10, #22
 80042e2:	f811 5f10 	ldrb.w	r5, [r1, #16]!
 80042e6:	559d      	strb	r5, [r3, r6]
 80042e8:	1c46      	adds	r6, r0, #1
 80042ea:	f36f 269f 	bfc	r6, #10, #22
 80042ee:	784d      	ldrb	r5, [r1, #1]
 80042f0:	788c      	ldrb	r4, [r1, #2]
 80042f2:	559d      	strb	r5, [r3, r6]
 80042f4:	1c85      	adds	r5, r0, #2
 80042f6:	f100 0609 	add.w	r6, r0, #9
 80042fa:	f36f 259f 	bfc	r5, #10, #22
 80042fe:	78ca      	ldrb	r2, [r1, #3]
 8004300:	f891 e004 	ldrb.w	lr, [r1, #4]
 8004304:	f36f 269f 	bfc	r6, #10, #22
 8004308:	555c      	strb	r4, [r3, r5]
 800430a:	1cc4      	adds	r4, r0, #3
 800430c:	f36f 249f 	bfc	r4, #10, #22
 8004310:	794d      	ldrb	r5, [r1, #5]
 8004312:	f891 8009 	ldrb.w	r8, [r1, #9]
 8004316:	551a      	strb	r2, [r3, r4]
 8004318:	1d02      	adds	r2, r0, #4
 800431a:	f36f 229f 	bfc	r2, #10, #22
 800431e:	7a0c      	ldrb	r4, [r1, #8]
 8004320:	f803 e002 	strb.w	lr, [r3, r2]
 8004324:	1d42      	adds	r2, r0, #5
 8004326:	f36f 229f 	bfc	r2, #10, #22
 800432a:	549d      	strb	r5, [r3, r2]
 800432c:	1d82      	adds	r2, r0, #6
 800432e:	f36f 229f 	bfc	r2, #10, #22
 8004332:	798d      	ldrb	r5, [r1, #6]
 8004334:	549d      	strb	r5, [r3, r2]
 8004336:	1dc2      	adds	r2, r0, #7
 8004338:	f36f 229f 	bfc	r2, #10, #22
 800433c:	79cd      	ldrb	r5, [r1, #7]
 800433e:	549d      	strb	r5, [r3, r2]
 8004340:	f100 0208 	add.w	r2, r0, #8
 8004344:	f36f 229f 	bfc	r2, #10, #22
 8004348:	549c      	strb	r4, [r3, r2]
 800434a:	f100 020a 	add.w	r2, r0, #10
 800434e:	f36f 229f 	bfc	r2, #10, #22
 8004352:	f803 8006 	strb.w	r8, [r3, r6]
 8004356:	7a8e      	ldrb	r6, [r1, #10]
 8004358:	549e      	strb	r6, [r3, r2]
 800435a:	f100 020b 	add.w	r2, r0, #11
 800435e:	f36f 229f 	bfc	r2, #10, #22
 8004362:	7ace      	ldrb	r6, [r1, #11]
 8004364:	549e      	strb	r6, [r3, r2]
 8004366:	f100 020c 	add.w	r2, r0, #12
 800436a:	f36f 229f 	bfc	r2, #10, #22
 800436e:	7b0e      	ldrb	r6, [r1, #12]
 8004370:	549e      	strb	r6, [r3, r2]
 8004372:	f100 020d 	add.w	r2, r0, #13
 8004376:	f36f 229f 	bfc	r2, #10, #22
 800437a:	7b4e      	ldrb	r6, [r1, #13]
 800437c:	549e      	strb	r6, [r3, r2]
 800437e:	f100 020e 	add.w	r2, r0, #14
 8004382:	f36f 229f 	bfc	r2, #10, #22
 8004386:	7b8e      	ldrb	r6, [r1, #14]
 8004388:	549e      	strb	r6, [r3, r2]
 800438a:	f100 020f 	add.w	r2, r0, #15
 800438e:	3010      	adds	r0, #16
 8004390:	f36f 229f 	bfc	r2, #10, #22
 8004394:	7bce      	ldrb	r6, [r1, #15]
 8004396:	549e      	strb	r6, [r3, r2]
 8004398:	f101 0210 	add.w	r2, r1, #16
 800439c:	4562      	cmp	r2, ip
 800439e:	d19b      	bne.n	80042d8 <_ZN16display_firmware18__cortex_m_rt_main17hdabc9134f3bbe990E+0x104c>

birkenfeld avatar Jul 11 '21 14:07 birkenfeld