toolchain icon indicating copy to clipboard operation
toolchain copied to clipboard

Unexpected code is generated for __attribute__((uncached)) types

Open petrokarashchenko opened this issue 4 years ago • 14 comments

I have next code snippet

typedef volatile struct type1_s {
    uint32_t a1;
    uint8_t a2;
    uint8_t a3;
    uint8_t a4;
    uint8_t a5;
} __attribute__((packed,aligned(1),uncached)) type1_t;

typedef volatile struct {
    uint32_t b1;
    uint32_t b2;
} __attribute__((packed,aligned(1),uncached)) type2_t;

typedef volatile struct type3_s {
    type1_t h1;
    volatile union {
        uint8_t b[24];
        type2_t c1;
    } __attribute__((packed,aligned(1),uncached)) h2;
} __attribute__((packed,aligned(1),uncached)) type3_t;

typedef volatile struct type4_s {
    uint32_t x1;
    uint8_t x2;
    uint16_t x3;
    uint8_t x4;
    uint8_t x5;
    uint8_t x6;
} __attribute__((packed,aligned(1),uncached)) type4_t;

int my_func2(volatile __attribute__((uncached)) uint8_t *data, uint32_t size, int opt);

int my_func1(type3_t *p0, type4_t *p1)
{
    volatile __attribute__((uncached)) uint8_t *a0 = (volatile __attribute__((uncached)) uint8_t *)p0 + sizeof(type1_t);
    volatile __attribute__((uncached)) uint8_t *a1 = (volatile __attribute__((uncached)) uint8_t *)p0 + sizeof(type3_t);
    type2_t *b0 = (type2_t *)a0;
    p1->x6 = my_func2(a1, b0->b2, 65537) ? 0 : 1;
    return 0;
}

I compile it with next options with compiler based on 2020.03 branch:

CFLAGS   = -mcpu=arc600 -mtune=arc600 -mbig-endian -mmul64 -mbarrel-shifter
CFLAGS  += -ffunction-sections -fdata-sections -fno-common -fno-strict-aliasing -c -Os
CFLAGS  += -Wa,-adhlns="[email protected]"

I'm getting next output:

  10               		.section	.text.my_func1,"ax",@progbits
  11               		.align 4
  12               		.global	my_func1
  14               	my_func1:
  15 0000 C0F1      		push_s	blink
  16 0002 C5E1      		push_s	r13
  17 0004 884C      		ldb_s r2,[r0,12]
  18 0006 886D      		ldb_s r3,[r0,13]
  19 0008 7528      		mov_s	r13,r1	;0
  20 000a BA18      		asl_s r2,r2,24
  21 000c 7B6F      		extb_s r3,r3
  22 000e BB10      		asl_s r3,r3,16
  23 0010 7A65      		or_s	r2,r2,r3
  24 0012 886E      		ldb_s r3,[r0,14]
  25 0014 882F      		ldb_s r1,[r0,15]
  26 0016 E020      		add_s r0,r0,32 ;1
  27 0018 7B6F      		extb_s r3,r3
  28 001a BB08      		asl_s r3,r3,8
  29 001c 7B45      		or_s	r3,r3,r2
  30 001e 792F      		extb_s r1,r1
  31 0020 220A 0F80 		mov	r2,65537	;14
  31      0001 0001 
  32 0028 7965      		or_s	r1,r1,r3
  33 002a 2022 0F80 		jl @my_func2
  33      0000 0000 
  34 0032 780B      		tst_s r0,r0
  35 0034 D801      		mov_s	r0,1	;0
  36 0036 78C0      		sub_s.ne	r0,r0,r0
  37 0038 1D09 1022 		stb.di	r0,[r13,9]
  38 003c D800      		mov_s	r0,0	;0
  39 003e C5C1      		pop_s	r13
  40 0040 C0D1      		pop_s	blink
  41 0042 7EE0      		j_s	[blink]

Here I'm referring to second parameter passed to my_func2 call. Seems the b0->b2 access does not bypass cache

petrokarashchenko avatar Mar 26 '20 21:03 petrokarashchenko

The excess code is due to aligned property of accesses in ARC600 processor. Also your structures are byte aligned which means that all your accesses are done via ldb or stb, which doesn't really help. You can quickly see that using -munaligned-access generates better code. Of course loosing .di is a bug, which seems located in the part of the compiler which breaks down the 32-bit access to 8-bit accesses. What you can do, you can try either using only 8-bit members in your structures, and assemble them latter on, or relax the structure align property.

claziss avatar Mar 27 '20 10:03 claziss

@claziss the relaxing of align property does not help. Actually I think for GCC packed means aligned(1). At least when I change all __attribute__((packed,aligned(1),uncached)) to __attribute__((packed,uncached)) I'm getting same code generated.

  10               		.section	.text.my_func1,"ax",@progbits
  11               		.align 4
  12               		.global	my_func1
  14               	my_func1:
  15 0000 C0F1      		push_s	blink
  16 0002 C5E1      		push_s	r13
  17 0004 884C      		ldb_s r2,[r0,12]
  18 0006 886D      		ldb_s r3,[r0,13]
  19 0008 7528      		mov_s	r13,r1	;0
  20 000a BA18      		asl_s r2,r2,24
  21 000c 7B6F      		extb_s r3,r3
  22 000e BB10      		asl_s r3,r3,16
  23 0010 7A65      		or_s	r2,r2,r3
  24 0012 886E      		ldb_s r3,[r0,14]
  25 0014 882F      		ldb_s r1,[r0,15]
  26 0016 E020      		add_s r0,r0,32 ;1
  27 0018 7B6F      		extb_s r3,r3
  28 001a BB08      		asl_s r3,r3,8
  29 001c 7B45      		or_s	r3,r3,r2
  30 001e 792F      		extb_s r1,r1
  31 0020 220A 0F80 		mov	r2,65537	;14
  31      0001 0001 
  32 0028 7965      		or_s	r1,r1,r3
  33 002a 2022 0F80 		jl @my_func2
  33      0000 0000 
  34 0032 780B      		tst_s r0,r0
  35 0034 D801      		mov_s	r0,1	;0
  36 0036 78C0      		sub_s.ne	r0,r0,r0
  37 0038 1D09 1022 		stb.di	r0,[r13,9]
  38 003c D800      		mov_s	r0,0	;0
  39 003e C5C1      		pop_s	r13
  40 0040 C0D1      		pop_s	blink
  41 0042 7EE0      		j_s	[blink]

I will try to experiment with -munaligned-access and feedback if it helps

petrokarashchenko avatar Mar 27 '20 10:03 petrokarashchenko

@claziss -munaligned-access seems to generate better code and it does not drop .di suffix. But seems ARC600 does not support unaligned access. So I will need to wait for a fix with 32 bit value assembling. Also thank you for proposed workaround with 8 bit access and assembling by myself explicitly

petrokarashchenko avatar Mar 27 '20 10:03 petrokarashchenko

Alternatively, you can use -mno-volatile-cache option with your particular function to generate .di accesses whenever a variable is volatile. This can be used as a work-around for the time being.

claziss avatar Apr 01 '20 07:04 claziss

@claziss Thank you. Actually applying -mno-volatile-cache helps and I'm getting my code working as expected. I have few volatile variables that are shared between main thread and ISR and do not require .di access, but getting .di for those is a valuable trade-off for getting other parts of code running as expected. Looking forward to get issue fix in one of next releases, so I can drop -mno-volatile-cache in future.

petrokarashchenko avatar Apr 01 '20 09:04 petrokarashchenko

Unfortunately usage of -mno-volatile-cache brings new issue. As I have mentioned I have few volatile variables that are cached, but with -mno-volatile-cache become uncached. All other cases that I'm using volatile __attribute__((uncached)) attributes I'm using with pointers to specially allocated "cache safe" (cache line size aligned for both address and memory chunk size) regions. So now after volatile variables become uncached, but address and size are not "cache safe". So now when cache sync is issued for a line that includes address of volatile variable I'm loosing a value of that variable.

petrokarashchenko avatar Apr 09 '20 11:04 petrokarashchenko

The workaround can be go thru the code and mark all volatile variables and place into a separate section and align section in linker script

petrokarashchenko avatar Apr 09 '20 11:04 petrokarashchenko

@claziss I'm wondering on how -mno-volatile-cache makes compiler to generate correct code. I was looking in the code and found that it seems to manipulates only with VOLATILE_CACHE_SET and it is only checked for compact_..._operand selection, but that influence inly on size optimisation. I can't find how -mno-volatile-cache works to generate .di instructions for volatile access.

petrokarashchenko avatar Apr 10 '20 21:04 petrokarashchenko

It just check if the variable is volatile, if so, the .di is generated.

claziss avatar Apr 11 '20 09:04 claziss

Some more reference cases. Somehow if the first member of a struct is accessed, then compiler generates correct code, but for all other members the code is incorrect:

#include <stdint.h>

typedef volatile struct {
    uint32_t a0;
    uint32_t a1;
} __attribute__((packed,uncached)) pu_st1;

int my_func2(uint32_t p0, uint32_t p1);

int my_func1(pu_st1 *p)
{
    return my_func2(p->a0, p->a1);
}

with options: arceb-elf32-gcc -save-temps -Wall -Wextra -c -mcpu=arc600 -mtune=arc600 -mbig-endian -mmul64 test.c -O0 I'm getting output:

	.global	my_func1
	.type	my_func1, @function
my_func1:
	push_s	blink
	push_s	r13
	st.a	fp,[sp,-4]	;26
	mov	fp,sp	;5
	sub_s sp,sp,4 ;4
	st	r0,[fp,-4]	;26
	ld	r2,[fp,-4]	;23
	ldb.di	r3,[r2]
	extb_s r3,r3
	asl_s r3,r3,24
	ldb.di	r12,[r2,1]
	extb_s r12,r12
	asl_s r12,r12,16
	or	r3,r3,r12
	ldb.di	r12,[r2,2]
	extb_s r12,r12
	asl_s r12,r12,8
	or	r3,r3,r12
	ldb.di	r2,[r2,3]
	extb_s r2,r2
	or	r2,r2,r3
	mov	r13,r2	;5
	ld	r2,[fp,-4]	;23
	ldb_s r3,[r2,4]
	extb_s r3,r3
	asl_s r3,r3,24
	ldb_s r12,[r2,5]
	extb_s r12,r12
	asl_s r12,r12,16
	or	r3,r3,r12
	ldb_s r12,[r2,6]
	extb_s r12,r12
	asl_s r12,r12,8
	or	r3,r3,r12
	ldb_s r2,[r2,7]
	extb_s r2,r2
	or	r2,r2,r3
	mov	r1,r2	;5
	mov	r0,r13	;5
	bl @my_func2;1
	mov	r2,r0	;5
	mov	r0,r2	;5
	mov	sp,fp	;5
	ld.ab	fp,[sp,4]	;23
	ld	blink,[sp,4]	;23
	ld.ab	r13,[sp,8]	;23
	j_s	[blink]
	.size	my_func1, .-my_func1

petrokarashchenko avatar Apr 11 '20 11:04 petrokarashchenko

Also code:

typedef volatile struct {
    uint32_t a0[2];
} __attribute__((packed,uncached)) pu_st1;

int my_func2(uint32_t p0, uint32_t p1);

int my_func1(pu_st1 *p)
{
    return my_func2(p->a0[0], p->a0[1]);
}

and code

typedef uint32_t __attribute__((aligned(1))) na_u32;
typedef volatile struct {
    na_u32 a0[2];
} __attribute__((uncached)) pu_st1;

int my_func2(uint32_t p0, uint32_t p1);

int my_func1(pu_st1 *p)
{
    return my_func2(p->a0[0], p->a0[1]);
}

Generates same output

petrokarashchenko avatar Apr 11 '20 11:04 petrokarashchenko

@claziss I was able to narrow down the issue to a single line. It is located in gcc/emit-rtl.c file in adjust_address_1 method. packed attribute types member access is expanded into a set of MEM_REF's plus byte offsets. MEM_REF's of first 4 bytes inherit expression of original instructions, but if offset goes above 4, the expression is dropped, hence attributes are dropped as well. volatile access has MEM_VOLATILE_P() that is a memory attribute, but not expression attribute, so it is preserved. I have created https://github.com/foss-for-synopsys-dwc-arc-processors/gcc/pull/85 just to start a discussion and make all my finding visible. Unfortunately my understanding of GCC is not enough to analyse and generate a correct fix. But maybe this will give you some hint if this issue is still planned to be fixed in 2020.09

petrokarashchenko avatar Apr 25 '20 18:04 petrokarashchenko

@claziss is there any chance to get this issue fixed in 2020.09 release? Meanwhile, I've done some communication with the community trying to discuss the point that I found in adjust_address_1 implementation. I've sent a patch to GCC community and made some discussion. I pointed to how currently uncached attribute is processed using

 if (TREE_CODE (addr) == MEM_REF)
    {
      attrs = TYPE_ATTRIBUTES (TREE_TYPE (TREE_OPERAND (addr, 0)));
      if (lookup_attribute ("uncached", attrs))
        return true;
      attrs = TYPE_ATTRIBUTES (TREE_TYPE (TREE_OPERAND (addr, 1)));
      if (lookup_attribute ("uncached", attrs))
        return true;
    }

and got next comment from Richard Sandiford: Just a note - this looks very unsafe since the type of the address operand to the MEM_REF is quite arbitrary.

Hmm, the MEM_EXPR is used for optimization only and may not be the only way to carry information required for correctness. That said, dropping mem_attrs must be always correct.

This means you have to maintain the 'uncached' effect elsewhere.

@claziss @abrodkin can we discuss on how to get a reliable implementation of uncached attribute?

petrokarashchenko avatar Aug 31 '20 08:08 petrokarashchenko

Hi @petrokarashchenko sorry for not solving this in 2020.09 as this one doesn't seem as a low-hanging fruit. Moving to the next release with hope to spend more time on that looking forward.

abrodkin avatar Oct 22 '20 10:10 abrodkin