toolchain
toolchain copied to clipboard
Unexpected code is generated for __attribute__((uncached)) types
I have next code snippet
typedef volatile struct type1_s {
uint32_t a1;
uint8_t a2;
uint8_t a3;
uint8_t a4;
uint8_t a5;
} __attribute__((packed,aligned(1),uncached)) type1_t;
typedef volatile struct {
uint32_t b1;
uint32_t b2;
} __attribute__((packed,aligned(1),uncached)) type2_t;
typedef volatile struct type3_s {
type1_t h1;
volatile union {
uint8_t b[24];
type2_t c1;
} __attribute__((packed,aligned(1),uncached)) h2;
} __attribute__((packed,aligned(1),uncached)) type3_t;
typedef volatile struct type4_s {
uint32_t x1;
uint8_t x2;
uint16_t x3;
uint8_t x4;
uint8_t x5;
uint8_t x6;
} __attribute__((packed,aligned(1),uncached)) type4_t;
int my_func2(volatile __attribute__((uncached)) uint8_t *data, uint32_t size, int opt);
int my_func1(type3_t *p0, type4_t *p1)
{
volatile __attribute__((uncached)) uint8_t *a0 = (volatile __attribute__((uncached)) uint8_t *)p0 + sizeof(type1_t);
volatile __attribute__((uncached)) uint8_t *a1 = (volatile __attribute__((uncached)) uint8_t *)p0 + sizeof(type3_t);
type2_t *b0 = (type2_t *)a0;
p1->x6 = my_func2(a1, b0->b2, 65537) ? 0 : 1;
return 0;
}
I compile it with next options with compiler based on 2020.03
branch:
CFLAGS = -mcpu=arc600 -mtune=arc600 -mbig-endian -mmul64 -mbarrel-shifter
CFLAGS += -ffunction-sections -fdata-sections -fno-common -fno-strict-aliasing -c -Os
CFLAGS += -Wa,-adhlns="[email protected]"
I'm getting next output:
10 .section .text.my_func1,"ax",@progbits
11 .align 4
12 .global my_func1
14 my_func1:
15 0000 C0F1 push_s blink
16 0002 C5E1 push_s r13
17 0004 884C ldb_s r2,[r0,12]
18 0006 886D ldb_s r3,[r0,13]
19 0008 7528 mov_s r13,r1 ;0
20 000a BA18 asl_s r2,r2,24
21 000c 7B6F extb_s r3,r3
22 000e BB10 asl_s r3,r3,16
23 0010 7A65 or_s r2,r2,r3
24 0012 886E ldb_s r3,[r0,14]
25 0014 882F ldb_s r1,[r0,15]
26 0016 E020 add_s r0,r0,32 ;1
27 0018 7B6F extb_s r3,r3
28 001a BB08 asl_s r3,r3,8
29 001c 7B45 or_s r3,r3,r2
30 001e 792F extb_s r1,r1
31 0020 220A 0F80 mov r2,65537 ;14
31 0001 0001
32 0028 7965 or_s r1,r1,r3
33 002a 2022 0F80 jl @my_func2
33 0000 0000
34 0032 780B tst_s r0,r0
35 0034 D801 mov_s r0,1 ;0
36 0036 78C0 sub_s.ne r0,r0,r0
37 0038 1D09 1022 stb.di r0,[r13,9]
38 003c D800 mov_s r0,0 ;0
39 003e C5C1 pop_s r13
40 0040 C0D1 pop_s blink
41 0042 7EE0 j_s [blink]
Here I'm referring to second parameter passed to my_func2
call. Seems the b0->b2
access does not bypass cache
The excess code is due to aligned property of accesses in ARC600 processor. Also your structures are byte aligned which means that all your accesses are done via ldb or stb, which doesn't really help. You can quickly see that using -munaligned-access
generates better code. Of course loosing .di
is a bug, which seems located in the part of the compiler which breaks down the 32-bit access to 8-bit accesses.
What you can do, you can try either using only 8-bit members in your structures, and assemble them latter on, or relax the structure align property.
@claziss the relaxing of align property does not help. Actually I think for GCC packed
means aligned(1)
. At least when I change all __attribute__((packed,aligned(1),uncached))
to __attribute__((packed,uncached))
I'm getting same code generated.
10 .section .text.my_func1,"ax",@progbits
11 .align 4
12 .global my_func1
14 my_func1:
15 0000 C0F1 push_s blink
16 0002 C5E1 push_s r13
17 0004 884C ldb_s r2,[r0,12]
18 0006 886D ldb_s r3,[r0,13]
19 0008 7528 mov_s r13,r1 ;0
20 000a BA18 asl_s r2,r2,24
21 000c 7B6F extb_s r3,r3
22 000e BB10 asl_s r3,r3,16
23 0010 7A65 or_s r2,r2,r3
24 0012 886E ldb_s r3,[r0,14]
25 0014 882F ldb_s r1,[r0,15]
26 0016 E020 add_s r0,r0,32 ;1
27 0018 7B6F extb_s r3,r3
28 001a BB08 asl_s r3,r3,8
29 001c 7B45 or_s r3,r3,r2
30 001e 792F extb_s r1,r1
31 0020 220A 0F80 mov r2,65537 ;14
31 0001 0001
32 0028 7965 or_s r1,r1,r3
33 002a 2022 0F80 jl @my_func2
33 0000 0000
34 0032 780B tst_s r0,r0
35 0034 D801 mov_s r0,1 ;0
36 0036 78C0 sub_s.ne r0,r0,r0
37 0038 1D09 1022 stb.di r0,[r13,9]
38 003c D800 mov_s r0,0 ;0
39 003e C5C1 pop_s r13
40 0040 C0D1 pop_s blink
41 0042 7EE0 j_s [blink]
I will try to experiment with -munaligned-access
and feedback if it helps
@claziss -munaligned-access
seems to generate better code and it does not drop .di
suffix. But seems ARC600 does not support unaligned access. So I will need to wait for a fix with 32 bit value assembling. Also thank you for proposed workaround with 8 bit access and assembling by myself explicitly
Alternatively, you can use -mno-volatile-cache
option with your particular function to generate .di
accesses whenever a variable is volatile. This can be used as a work-around for the time being.
@claziss Thank you. Actually applying -mno-volatile-cache
helps and I'm getting my code working as expected. I have few volatile
variables that are shared between main thread and ISR and do not require .di
access, but getting .di
for those is a valuable trade-off for getting other parts of code running as expected.
Looking forward to get issue fix in one of next releases, so I can drop -mno-volatile-cache
in future.
Unfortunately usage of -mno-volatile-cache
brings new issue. As I have mentioned I have few volatile
variables that are cached
, but with -mno-volatile-cache
become uncached
. All other cases that I'm using volatile __attribute__((uncached))
attributes I'm using with pointers to specially allocated "cache safe" (cache line size aligned for both address and memory chunk size) regions. So now after volatile
variables become uncached
, but address and size are not "cache safe". So now when cache sync is issued for a line that includes address of volatile
variable I'm loosing a value of that variable.
The workaround can be go thru the code and mark all volatile
variables and place into a separate section
and align section in linker script
@claziss I'm wondering on how -mno-volatile-cache
makes compiler to generate correct code. I was looking in the code and found that it seems to manipulates only with VOLATILE_CACHE_SET
and it is only checked for compact_..._operand
selection, but that influence inly on size optimisation. I can't find how -mno-volatile-cache
works to generate .di
instructions for volatile
access.
It just check if the variable is volatile
, if so, the .di
is generated.
Some more reference cases. Somehow if the first member of a struct is accessed, then compiler generates correct code, but for all other members the code is incorrect:
#include <stdint.h>
typedef volatile struct {
uint32_t a0;
uint32_t a1;
} __attribute__((packed,uncached)) pu_st1;
int my_func2(uint32_t p0, uint32_t p1);
int my_func1(pu_st1 *p)
{
return my_func2(p->a0, p->a1);
}
with options: arceb-elf32-gcc -save-temps -Wall -Wextra -c -mcpu=arc600 -mtune=arc600 -mbig-endian -mmul64 test.c -O0
I'm getting output:
.global my_func1
.type my_func1, @function
my_func1:
push_s blink
push_s r13
st.a fp,[sp,-4] ;26
mov fp,sp ;5
sub_s sp,sp,4 ;4
st r0,[fp,-4] ;26
ld r2,[fp,-4] ;23
ldb.di r3,[r2]
extb_s r3,r3
asl_s r3,r3,24
ldb.di r12,[r2,1]
extb_s r12,r12
asl_s r12,r12,16
or r3,r3,r12
ldb.di r12,[r2,2]
extb_s r12,r12
asl_s r12,r12,8
or r3,r3,r12
ldb.di r2,[r2,3]
extb_s r2,r2
or r2,r2,r3
mov r13,r2 ;5
ld r2,[fp,-4] ;23
ldb_s r3,[r2,4]
extb_s r3,r3
asl_s r3,r3,24
ldb_s r12,[r2,5]
extb_s r12,r12
asl_s r12,r12,16
or r3,r3,r12
ldb_s r12,[r2,6]
extb_s r12,r12
asl_s r12,r12,8
or r3,r3,r12
ldb_s r2,[r2,7]
extb_s r2,r2
or r2,r2,r3
mov r1,r2 ;5
mov r0,r13 ;5
bl @my_func2;1
mov r2,r0 ;5
mov r0,r2 ;5
mov sp,fp ;5
ld.ab fp,[sp,4] ;23
ld blink,[sp,4] ;23
ld.ab r13,[sp,8] ;23
j_s [blink]
.size my_func1, .-my_func1
Also code:
typedef volatile struct {
uint32_t a0[2];
} __attribute__((packed,uncached)) pu_st1;
int my_func2(uint32_t p0, uint32_t p1);
int my_func1(pu_st1 *p)
{
return my_func2(p->a0[0], p->a0[1]);
}
and code
typedef uint32_t __attribute__((aligned(1))) na_u32;
typedef volatile struct {
na_u32 a0[2];
} __attribute__((uncached)) pu_st1;
int my_func2(uint32_t p0, uint32_t p1);
int my_func1(pu_st1 *p)
{
return my_func2(p->a0[0], p->a0[1]);
}
Generates same output
@claziss I was able to narrow down the issue to a single line. It is located in gcc/emit-rtl.c
file in adjust_address_1
method. packed
attribute types member access is expanded into a set of MEM_REF
's plus byte offsets. MEM_REF
's of first 4 bytes inherit expression of original instructions, but if offset goes above 4, the expression is dropped, hence attributes are dropped as well. volatile
access has MEM_VOLATILE_P()
that is a memory attribute, but not expression attribute, so it is preserved. I have created https://github.com/foss-for-synopsys-dwc-arc-processors/gcc/pull/85 just to start a discussion and make all my finding visible.
Unfortunately my understanding of GCC is not enough to analyse and generate a correct fix. But maybe this will give you some hint if this issue is still planned to be fixed in 2020.09
@claziss is there any chance to get this issue fixed in 2020.09
release?
Meanwhile, I've done some communication with the community trying to discuss the point that I found in adjust_address_1
implementation. I've sent a patch to GCC community and made some discussion. I pointed to how currently uncached
attribute is processed using
if (TREE_CODE (addr) == MEM_REF)
{
attrs = TYPE_ATTRIBUTES (TREE_TYPE (TREE_OPERAND (addr, 0)));
if (lookup_attribute ("uncached", attrs))
return true;
attrs = TYPE_ATTRIBUTES (TREE_TYPE (TREE_OPERAND (addr, 1)));
if (lookup_attribute ("uncached", attrs))
return true;
}
and got next comment from Richard Sandiford: Just a note - this looks very unsafe since the type of the address operand to the MEM_REF is quite arbitrary.
Hmm, the MEM_EXPR is used for optimization only and may not be the only way to carry information required for correctness. That said, dropping mem_attrs must be always correct.
This means you have to maintain the 'uncached' effect elsewhere.
@claziss @abrodkin can we discuss on how to get a reliable implementation of uncached
attribute?
Hi @petrokarashchenko sorry for not solving this in 2020.09 as this one doesn't seem as a low-hanging fruit. Moving to the next release with hope to spend more time on that looking forward.