gnupru
gnupru copied to clipboard
help with program-memory-overflow
Hi @dinuxbg,
this is less a bugreport, more like a support question. We currently port our firmware and hoped GCC would produce more efficient code. It worked before for us, but one PRU-codebase is overflowing IMEM by 33 % while LTO and -Os is enabled. It fits and works when compiling it with CGT.
The project uses c99, no float, but u64-math on the overflowing PRU0. u64-math is also the reason we want to switch to GCC, as the CGT shows some flaws there.
We tried enabling -mabi=ti to get closer to CGT-behaviour (especially the smaller pointers), but the linker complained about undefined references to memcpy and memset (which our code-base does not directly use).
Do you have any hints how to solve or approach this?
I read through the gnupru github issues and large parts of the gcc documentation and even started comparing parts of the assembly from both compilers but found no hints yet.
Thank you so much for the report.
I hacked gnuprumcu
to increase the PRU0 IMEM size, so that I can analyse your codebase.
A few observations:
- In
mabi=ti
mode the C std library is not available. This is a known limitation. -
mul64
should not be necessary. GCC should use highly optimized 64-bit multiplication and not loops. - The
main
is the largest function reported by nm:
$ pru-nm --size-sort --print-size out/pru-core0.elf | grep -w '[Tt]'
I'll take a closer look into the generated code.
Thank you for your quick and positive reply!
You seem to appreciate a challenge :)
Your hack is dirty but helpful. I tried the dedicated gcc-flags for keeping the defective output without success.
I think we have to disable inlining to find the culprit - most FNs are only used once.
BTW: in the meantime i replaced uint64_t
with uint32_t
out of curiosity and the code compiles without the hack:
pru-nm --size-sort --print-size out/pru-core0.elf | grep -w '[Tt]'
20000138 0000001c t add32
20000138 0000001c t add64
20000538 00000034 t div_uV_n4
20000154 0000004c t cal_conv_adc_raw_to_uV
200004ec 0000004c t cal_conv_uV_to_dac_raw
2000056c 00000054 t ads8691_init
20000354 00000054 t send_message.constprop.0.isra.0
200002e0 00000074 t send_status.constprop.0
20000474 00000078 t mul64
200001a0 00000080 t harvester_initialize
20000220 000000c0 t harvest_iv_cv
200003a8 000000cc t dac8562_init
200005c0 00001778 T main
By using -fno-inline
and removing some small FNs due to 300 byte overflow i get:
[very small objects omitted]
20000198 0000000c t calibration_initialize
200004e4 0000000c t get_V_intermediate_raw
200004d8 0000000c t get_V_intermediate_uV
20000504 0000000c t get_state_log_intermediate
2000018c 0000000c t simple_mutex_exit
20000168 00000010 t iep_get_cnt_val
20000138 00000010 t iep_get_tmr_cmp_sts
200004f0 00000014 t get_I_mid_out_nA
20000178 00000014 t simple_mutex_enter
200002a0 00000014 t sub32
200002a0 00000014 t sub64
20000cd8 00000018 t dac8562_init.constprop.0
200001b4 00000018 t mul32
200001b4 00000018 t mul64
200009d8 00000018 t ring_init.constprop.0
20000148 00000020 t iep_clear_evt_cmp
200009f0 00000020 t set_batok_pin.constprop.0
20000ca8 00000030 t sample_dbg_adc
20000304 00000034 t div_uV_n4
20001994 00000044 t get_output_inv_efficiency_n4
20000ba8 00000044 t ring_get.constprop.0
20001bc0 00000044 t sample.constprop.0
200001cc 0000004c t cal_conv_uV_to_dac_raw
200002b4 00000050 t cal_conv_adc_raw_to_uV
20000b58 00000050 t ring_put.constprop.0
20000988 00000050 t sample_iv_harvester
20001624 00000054 t ads8691_init
20000dd0 00000054 t send_message.constprop.0.isra.0
20000bec 0000005c t sample_emu_ADCs
20000c48 00000060 t sample_hrv_ADCs
20001844 00000064 t get_input_efficiency_n8
200015c0 00000064 t sample_adc_harvester
20000d64 0000006c t receive_message.constprop.0
20000cf0 00000074 t send_status.constprop.0
20000220 00000080 t cal_conv_adc_raw_to_nA
20000510 00000080 t harvester_initialize
20000f7c 00000084 t sample_dbg_dac
2000110c 00000094 t harvest_adc_cv
200019d8 000000a0 t converter_calc_out_power
20000590 000000c0 t harvest_iv_cv
20000338 000000c4 t converter_initialize
200003fc 000000dc t converter_update_cap_storage
200018a8 000000ec t converter_calc_inp_power
20000888 00000100 t harvest_iv_mppt_opt
20000650 00000108 t harvest_iv_mppt_voc
20001000 0000010c t handle_kernel_com.constprop.0.isra.0
200012dc 00000120 t harvest_adc_mppt_voc
20000758 00000130 t harvest_iv_mppt_po
200011a0 0000013c t harvest_adc_ivcurve
20000a10 00000148 t converter_update_states_and_output.constprop.0
20001a78 00000148 t sample_emulator.constprop.0
20000e24 00000158 t handle_buffer_swap.constprop.0
20001e1c 000001c0 T main
200013fc 000001c4 t harvest_adc_mppt_po
20001678 000001cc t sample_init.constprop.0
20001c04 00000218 t event_loop.constprop.0
two more questions regarding mabi=ti
:
- do you see a size-benefit or is this approach silly?
- would it be possible to link against TIs libs?
CGT tells me what it uses:
automatic RTS selection: linking in "rtspruv3_le.lib" in
place of index library "libc.a"
BTW: in the meantime i replaced uint64_t with uint32_t out of curiosity and the code compiles without the hack:
FYI, I've been trying for a while to optimize 64-bit arithmetic, but haven't made much progress: https://github.com/dinuxbg/gnupru/issues/32
do you see a size-benefit or is this approach silly?
You loose the C library. And you gain a little bit of DRAM if you use function pointers. So I don't think it would help your case.
would it be possible to link against TIs libs?
Yes. See https://github.com/dinuxbg/gnupru/tree/master/testing/interop for some GCC and CGT interoperability tests.
I found two inefficiencies. I'll start working on them.
- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106562
- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106564
ok, so this issue #32 seems to be one part of the problem. I only have a vague idea, but guess there is more to it, right? When comparing sizes the u32-modded versions of u64-functions where half or 1/3 the size. There are mostly shifting-Ops inside - arithmetics are mostly done in the dedicated overflow-safe Fns. And another strange observation is that -fno-inline
actually shrinks codesize here (overflow is reduced from 2672 to 2276 bytes).
Allright - your second comment just came in and confirms my theory.
Is there anything i can do to help with this?
Otherwise I would look into dividing our codebase into the two subsystems. But timing-constraints were tough already. Probably GCC won't help us here for now. But we keep this solution in our sight.
Thanks again for your quick and helpful support!
And another strange observation is that -fno-inline actually shrinks codesize here (overflow is reduced from 2672 to 2276 bytes).
Yes, excessive register pressure might cause that. I'll check the GCC insn costs for PRU backend.
Is there anything i can do to help with this?
Unfortunately there is no mitigation from firmware side. But please file more github issues if you notice other missed optimizations.
Condensed test cases like the two bugzilla reports above are appreciated, but full firmware source is the next best thing.
Thanks, Dimitar
With GCC trunk e95e91eccd, code size is down to 9352 bytes (overflow is 1160).
I'm continuing the analysis for other areas to improve.
And with LTO enabled it is even better:
$ pru-size out/pru-core0.elf
text data bss dec hex filename
8720 176 824 9720 25f8 out/pru-core0.elf
I thank you so much for this!
I also have LTO enabled on my current dev-branch and you seem to have decreased the overflow by more than 2 kByte!
text data bss dec hex filename
10784 176 824 11784 2e08 gen_gcc/pru0-shepherd-fw.elf
Adding -fno-inline
should also shave off ~ 400 byte.
Do you know an option to see the these size-statistics for the CGT-binary? i looked through the ti-docs and found nothing so far.
Do you know an option to see the these size-statistics for the CGT-binary? i looked through the ti-docs and found nothing so far.
Both CGT and GNU use ELF file format. There are some small differences in how the two toolchains handle the ELF format, but they mainly affect the process of the final program linking. You should be able to invoke pru-size
and pru-nm
on the CGT-generated binary.
Two suggestions which would save a few more bytes:
- You could remove
max_value
andmin_value
since those do not seem to be used. - It's more efficient to declare
msb_position
as inline function than a callable assembler function. Example:
static inline uint32_t msb_position(uint32_t value)
{
uint32_t ret;
asm volatile (
"lmbd %[datain], %[dataout], 1\n\t"
: [dataout] "=r" (ret)
: [datain] "r" (value)
);
return ret;
}
This unresolved middle-end issue also affects the code quality of this particular firmware program: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81357