toolchain
toolchain copied to clipboard
added mulhu and mulhs CRT routines
Added multiply high signed/unsigned routines. These can be used to optimize division by a constant. __smulhu is optimized, but the rest are not well optimized. They use the exact same calling convention as the regular multiplication routines. We can optimize these routines in later PR's.
__smulhu : HL = ((uint32_t) HL * (uint32_t) BC) >> 16
__imulhu : UHL = ((uint48_t) UHL * (uint48_t) UBC) >> 24
__lmulhu : E:UHL = ((uint64_t) E:UHL * (uint64_t) A:UBC) >> 32
__i48mulhu : UDE:UHL = ((uint96_t) UDE:UHL * (uint96_t) UIY:UBC) >> 48
__llmulhu : BC:UDE:UHL = ((uint128_t)BC:UDE:UHL * (uint128_t) (SP64)) >> 64
__smulhs : HL = ((int32_t) HL * (int32_t) BC) >> 16
__imulhs : UHL = ((int48_t) UHL * (int48_t) UBC) >> 24
__lmulhs : E:UHL = ((int64_t) E:UHL * (int64_t) A:UBC) >> 32
__i48mulhs : UDE:UHL = ((int96_t) UDE:UHL * (int96_t) UIY:UBC) >> 48
__llmulhs : BC:UDE:UHL = ((int128_t) BC:UDE:UHL * (int128_t) (SP64)) >> 64
__smulhu : 32 bytes | 33F + 12R + 9W + 17
__imulhu : 117 bytes | 118F + 39R + 38W + 37
__lmulhu : 1 call to __llmulu
__i48mulhu : 93 bytes | 902F + 246R + 182W + 344
__llmulhu : (disables interrupts to use exx) slightly faster than 2 calls to __llmulu
__bmulhu was not added since it is just mlt bc \ ld a, b (and the 8-bit calling convention is not well defined).