Native support for AVR instrinsics
At the moment compiler-builtins isn't useful for AVR, because that platform uses a custom calling convention for intrinsics - supporting AVR here would require:
- Implementing this custom calling convention in LLVM.
- Exposing it in rustc.
- Finally, using it within compiler-builtins.
There are no ongoing plans to do so yet, but over https://github.com/rust-lang/rust/pull/131651 it was pointed out it would be nice to have some kind of message of this issue left, so here we go 🙂
Thanks for filing this. I don't think we are blocked on LLVM having support for the calling convention, instead we should be able to use naked functions to translate the cc like we do for aeabi intrinsics https://github.com/rust-lang/compiler-builtins/blob/cb060052ab7e4bad408c85d44be7e60096e93e38/src/arm.rs#L21-L35.
I'm not positive how this would interact with the clobbers, however.
Oh, sure, that seems like a good idea 🙂
Separately you may be interested in getting AVR asm to stable. This isn't a blocker for this crate but it may be useful in user code https://github.com/rust-lang/rust/issues/93335
Also, do you know if there is a version of these symbols in assembly that don't come from libgcc? Some of these are probably easy enough that we could provide our own assembly implementations, but we need to be careful with licensing.
Also, do you know if there is a version of these symbols in assembly that don't come from libgcc?
I'm aware only of the libgcc implementation, unfortunately.
After https://github.com/rust-lang/rust/pull/131651 merges it should be possible to fix the AVRTiny asm convention, at which point I think we will be okay to start using AVR assembly in this repo. The #[naked] wrapper to convert the ABI and forward to our implementations is probably still the best option.
@Patryk27 do you happen to know which intrinsics we are actually missing? Testing with the below:
#![no_std]
#[unsafe(no_mangle)]
pub fn call_umulhisi3(a: &u16, b: &u16, r: &mut u32) {
*r = *a as u32 * *b as u32;
}
#[unsafe(no_mangle)]
pub fn call_mulhisi3(a: &i16, b: &i16, r: &mut i32) {
*r = *a as i32 * *b as i32;
}
#[unsafe(no_mangle)]
pub fn call_udivmodqi4(a: &u8, b: &u8, r1: &mut u8, r2: &mut u8) {
(*r1, *r2) = (*a / *b, *a % *b);
}
#[unsafe(no_mangle)]
pub fn call_divmodqi4(a: &i8, b: &i8, r1: &mut i8, r2: &mut i8) {
(*r1, *r2) = (*a / *b, *a % *b);
}
#[unsafe(no_mangle)]
pub fn call_udivmodhi4(a: &u16, b: &u16, r1: &mut u16, r2: &mut u16) {
(*r1, *r2) = (*a / *b, *a % *b);
}
#[unsafe(no_mangle)]
pub fn call_divmodhi4(a: &i16, b: &i16, r1: &mut i16, r2: &mut i16) {
(*r1, *r2) = (*a / *b, *a % *b);
}
On the atmega328p target-cpu (unsure whether this makes a difference), the only intrinsics from the list at https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention are the 8- and 16-bit integer division __u?divmod[hq]i4. For the multiplication, LLVM seems to widen the integers to u32 and call the default builtin __mulsi3 which I don't think would have the special ABI. Do you typically see more missing symbols (beyond small int division) when you try linking projects without the AVR libgcc?
Status: was a little bit busy previous week, I'll take a look at those intrinsics and your example in a couple of days 🙂
You're right - it seems that the only Funny Intrinsics™ used at the moment are __udivmodqi4, __divmodqi4, __udivmodhi4, and __divmodhi4.
I have tried to summon __umulhisi3 with:
#[inline(never)]
pub fn fun(a: &u16, b: &u16, x: &mut u16, y: &mut u16) {
(*x, *y) = a.widening_mul(*b);
}
... but even this causes LLVM to go with 16b->32b extension followed by 32b x 32b multiplication (aka __mulsi3).
All of those observations match the intrinsics actually hardcoded into the codegen:
https://github.com/llvm/llvm-project/blob/c979ce7e362908a361cee721d2a6db4bcd65be1f/llvm/lib/Target/AVR/AVRISelLowering.cpp#L201
... which confirms that we'd only have to provide __u?divmod[hq]i4.
Curiously enough, some time ago I thought that ABI conflicts were the reason why f32 is broken on AVR:
https://github.com/rust-lang/compiler-builtins/pull/527
... but since they don't have special ABI, it must be a codegen bug, actually (this particular investigation will follow on https://github.com/Rahix/avr-hal/discussions/641).
Alright, there is another kind of ABI mismatch happening, though!
Context:
- https://github.com/Rahix/avr-hal/discussions/641#discussioncomment-12502991
- https://github.com/Rahix/avr-hal/discussions/641#discussioncomment-12508954
tl;dr intrinsics like those:
https://github.com/rust-lang/compiler-builtins/blob/7bec089672eb5cd83d7902edd59479527bc9d8d1/src/float/cmp.rs#L114
... should (most likely?) return c_int instead of i32, as one some platforms (AVR notably) those are actually (expected to be) 16-bit numbers.
Edit: actually, it's not even c_int - that's how LLVM defines it:
- https://github.com/llvm/llvm-project/blob/1e6ba3cd2fe96be00b6ed6ba28b3d9f9271d784d/compiler-rt/lib/builtins/fp_compare_impl.inc#L22
- https://github.com/llvm/llvm-project/blob/1e6ba3cd2fe96be00b6ed6ba28b3d9f9271d784d/compiler-rt/lib/builtins/comparedf2.c#L58
Should be easy to implement with a regular cfg-guarded type alias.
It seems there's an ABI difference in regards to 32-bit division and remainder as well.
Instead of compiler-builtins' fn divrem(T, T, &mut Option<T>) -> T or even GCC's fn divrem(T, T, bool) -> T, AVR expects fn divrem(T, T) -> (T, T) (even for i32/u32) - it's not written down in GCC's docs, but that's what the codegen says:
https://github.com/llvm/llvm-project/blob/1e6ba3cd2fe96be00b6ed6ba28b3d9f9271d784d/llvm/lib/Target/AVR/AVRISelLowering.cpp#L572
... and that's what also came out when I was running avr-tester's random-math test-suite on compiler-builtins without any #[avr_skip] (i.e. it printed trash for division and remainder, which prompted me to investigate those ops).
Now that I have a deeper understanding of how the pieces come together, I see a light out of the tunnel - I'm preparing a pull request that gets rid of all/most #[avr_skip]s, we'll see how it goes.
Ok, it seems that implementing __udivmodsi4 this way:
#[cfg(target_arch = "avr")]
intrinsics! {
#[maybe_use_optimized_c_shim]
pub extern "C" fn __udivmodsi4(n: u32, d: u32) -> u64 {
let (div, rem) = u32_div_rem(n, d);
((rem as u64) << 32) | (div as u64)
}
}
... yields a working intrinsic!
Unfortunately we can't go with the simpler -> (u32, u32), because that forces a stack allocation instead of passing stuff through registers, but the bit-shifting approach is good enough - LLVM seems to correctly understand it as mostly a regalloc-hinting thingie.
I'm experimenting with __udivmodqi4 and get the same code gen using naked_asm with the gcc implementation.
#[naked]
pub unsafe extern "C" fn __udivmodqi4() {
// r25: remainder
// r24: dividend, quotient
// r22: divisor
// r23: loop counter
core::arch::naked_asm!(
"clr r25", // clear remainder
"ldi r23, 8", // init loop counter
"lsl r24", // shift dividend
// brne target
"rol r25", // shift dividend into remainder
"cp r25, r22", // compare remainder & divisor
"brcs 2", // remainder <= divisor
"sub r25, r22", // restore remainder
// brcs target
"rol r24", // shift dividend (with CARRY)
"dec r23", // decrement loop counter
"brne -14",
"com r24", // complement result (C flag was complemented in loop)
"ret",
);
}
naked_asm errors if you try to use non-integer loop labels, so I had to hard-code the offsets.
I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?
I assume we'd need to implement the calling convention somewhere to allow us to pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }
naked_asmerrors if you try to use non-integer loop labels, so I had to hard-code the offsets.
Integer labels should work here through right? So brne target gets 1: and brcs target gets 2:, the jumps become brcs 2f and brne 1b.
I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?
I assume we'd need to implement the calling convention somewhere to allow us to
pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }
It is possible to add compiler support for this ABI and implement it that way. That probably doesn't have to happen though - it's only needed a couple of places in this crate and likely nowhere else in the ecosystem, so lang/compiler support would be pretty heavy handed.
One option is to "trampoline" ABIs by using naked_asm! to interface between the custom ABI and whatever extern "C" is then call A rust-written function, like we do around here https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/arm.rs#L23-L77. The routine is so short though that I assume the overhead wouldn't be worth it.
So, feel free to submit a PR adding what you have there :) we don't need to add everything at once. edit: you beat me to it
I want to mention that I've known that the assembly for my division functions would blow up into an absolute monstrosity the more levels of functions there are and the smaller the registers are. I oriented around delegating to half size divisions to make a larger division, but LLVM would inline nest things which would tend to quadratic code complexity. The way I would do it for 8 and maybe 16 bit microcontrollers, given enough time, is to use a common set of byte-slice-based functions for addition, multiplication, etc. The same division function can be used for everything from u16 to u128 by transmuting &uX to &[u8] and &mut uX to &mut [u8], then giving these to the common functions. This would give the smallest binary and stack size which is really the priority on these small devices. This is actually what my awint crate does, in fact it has special casing to use u8 digits on AVR to make it more than viable. However, in the case of division specifically, good 'ole binary division is the best approach considering that multiplication has do be done bit by bit anyway. https://github.com/rust-lang/compiler-builtins/blob/9978a8b06b7c1b53a6c503a2bfe7aea9ba6ca98b/compiler-builtins/src/int/specialized_div_rem/binary_long.rs#L54 for example. Probably, the one good optimization to make is that there should be combined lhs += rhs << shift and lhs -= rhs << shift functions for implementing multiplication and division. You can brute-force test smaller cases to make sure they are correct, and rely on the fuzzing framework to check the rest.
For my understanding @Patryk27 @tones111, do you have any idea if more complex intrinsics tend to work on AVR since https://github.com/rust-lang/compiler-builtins/pull/791 should be in recent nightlies? Specifically ops involving i128, f128, and f16 which all tend to make ABI edge cases show up.
The unsigned division intrinsics have merged!
The signed equivalents are more complicated as the semantics differs between languages. The book "Hacker's Delight" (2nd edition, chapter 9) describes three different implementations (truncating, modulus, or floor) and lists some languages that have made differing choices.
From what I can find the Rust reference says
- "integer division rounds towards zero" and "remainder defined with truncating division
- "Rust uses a remainder defined with truncating division. Given
remainder = dividend % divisor, the remainder will have the same sign as the dividend.
Does the intrinsic need to implement Rust's division semantics or something else?
Also confusing is that Rust's Div trait says
"This operation will panic if other == 0 or the division results in overflow", however, it appears to give results for u8::MIN, which I would have expected to overflow. (playground)
Does the intrinsic need to implement Rust's division semantics or something else?
Usually language-specific semantics are implemented in the standard library, the builtins are closer to emulating behavior on real hardware As an example, the division routines are allowed to create UB if the divisor is 0.
The backends pick the actual semantics here, but LLVM and Cranelift almost always match GCC. It should act the same as __divmodsi4 and the other sizes that we have already.
Also confusing is that Rust's Div trait says "This operation will panic if
other == 0or the division results in overflow", however, it appears to give results foru8::MIN, which I would have expected to overflow. (playground)
The linked example is i8, -128/-128 = 1 should be correct. Replacing i8 with u8 does panic with 0/0 as expected.
After staring at __divmodqi4 for far too long I'm struggling to come up with anything sufficiently different from the gcc implementation.
Everything I've seen says to use an unsigned division and modify the sign of the results based on the sign of the inputs. Mathematically the gcc implementation is as straightforward as it gets.
- compute and store the sign bits
- convert to unsigned values
- perform unsigned division
- apply sign bits
At the cost of a few extra instructions I've packed the remainder sign into R0 instead of the T flag. There's no alternative algorithm for negation (implemented in the ALU). I'm unsure how to proceed.
pub unsafe extern "C" fn __divmodqi4() {
// compute signed 8-bit `n / d` and `n % d`.
//
// Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
// Inputs:
// R24: dividend
// R22: divisor
// Outputs:
// R24: quotient (dividend / divisor)
// R25: remainder (dividend % divisor)
// Clobbers:
// R23: loop counter
// R0: sign bits
// T: unused
core::arch::naked_asm!(
// This assembly routine adjusts the inputs to perform an unsigned division.
// The quotient is negative when the dividend and divisor signs differ.
// The remainder has the same sign as the dividend.
"mov R0, R22",
"eor R0, R24", // R0.7 is the quotient sign
"cbr R0, 0", // R0.0 is the remainder sign
"tst R24",
"brpos 1f", // if dividend is negative
"neg R24", // negate to a positive value
"sbr R0, 0", // set remainder sign
"1:",
"sbrc R22, 7", // if divisor is negative
"neg R22", // negate to a positive value
"call __udivmodqi4", // perform unsigned division
// R24 = quotient, R25 = remainder
"sbrc R0, 7", // apply quotient sign
"neg R24",
"sbrc R0, 0", // apply remainder sign
"neg R25",
"ret"
);
}
I'm not a lawyer, but I don't think we have to be 100% different from GCC, it's mostly about "was the code copy-pasted verbatim" etc.; I'd guess that what you propose - assuming the algorithm is alright, haven't checked this one - is good.
@tgross35?
IIUC even beyond avoiding copy+paste you can't take inspiration from the source, and that's the part that matters rather than how different it is. Which is tough when you have read the source, I don't know what the safe way out of this is.
@joshtriplett is much more the expert here than me
It's might also be feasible to put all of this into a new file that is GPL and say that on AVR only you will be using GPL sources (already the case for any users that are linking GCC's runtime libs), which is nice because it would let you take the exact implementation. The downside is of course how the licensing changes propagate everywhere, I'm not sure what that would do to rust-lang/rust.
The approach given is the only way I've ever known to do signed division in the many implementations I've made over the years. I don't think there is any licensing issue as long as you wrote the assembly from scratch. The only special cases are related to the iX::MIN / -1 case and the division-by-zero cases. For iX::MIN / -1, the right answer is iX::MIN which is given by using wrapping operations, which should be automatically given by using wrapping operations in the routine above (unless doing saturating or checked division in which case you have to special case it). For the division-by-zero cases, you probably want to implement the RISC-V behavior from https://github.com/rust-lang/compiler-builtins/issues/485 . https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/m.html .