Performance of `overflowing_add`/`overflowing_sub` in comparison to `adc`/`sbb`
Within a project I'm working on, I noted notably better overall performance (10-20%) using overflowing_add/overflowing_sub instead of adc/sbb. I presume this is due to being able to constrain the over/underflow to only being a single bit, and not any value within the 64-bit range. Obviously, this comes with the trade-off of not having RustCrypto's efforts to ensure the method compiles to a constant-time set of instructions however.
As discussed here https://github.com/RustCrypto/crypto-bigint/issues/418#issuecomment-1858699908, these methods appear to be constant-time, and my own brief Godbolt doesn't seem to raise any concerns at this time: https://godbolt.org/z/3bPchf93x
I'd like to question if RustCrypto should offer an overflowing_add/overflowing_sub to accelerate use-cases where the carry is so constrainable, albeit deferring to the developer to be correct. Within RustCrypto, this would accelerate the Uint addition functions, assuming my casual observations are replicable.
It would incur a maintenance burden onto RustCrypto, and if the current functions end up compiling as variable-time, may provide little/no benefit overall as alternative pure-Rust implementations may lose any efficiency gains compared to the existing adc/sbb functions. I would be open to experimenting with a fork of crypto-bigint which does offer such functions, but before I did so, I wanted to ask if this would be outright rejected or if it had sufficient benefit, would be considered.
Do you have a concrete code example of where carrying_add nee adc/borrowing_sub nee sbb aren't lowering to the relevant hardware intrinsics?
I have a benchmark where a binary GCD function performs ~10% better after the replacement. I'm asking for the repository to be made available now so I can cite it in a couple of places, and will follow up here if so, and otherwise make a MRE.
I'll clarify I still had to specify multiple calls to overflowing_add. My presumption is the compiler is optimizing around the explicit context the top 63-bits of the 128-bit addition is discarded, and only a single bit matters. Obviously, the addition of three u64s (promoted to u128) may have the 66th bit set (((u64::MAX - 1) * 3) >> 65 == 1), so explicitly only regarding the 65th bit (representable as a boolean) is probably the context required for the code generation.
The immediate question can still be raised on if this would be 10% faster (which would involve an example demonstrating, I'd fork the crypto-bigint library and start work on moving over where appropriate to reproduce within this context and demonstrate), if the maintenance burden would be accepted or if that wouldn't be of interest.
I'd be curious to look at the assembly if you have a code example. carrying_add/add and borrowing_sub/sbb are particularly useful on Limb as part of a carry chain, and in the past I've validated they successfully lower to the x86 adc/sbb instructions.
That said, for higher-level functions on Uint something like overflowing_add/overflowing_sub sounds fine, particularly if you can validate it provides better performance.