added mulhu and mulhs CRT routines

Open ZERICO2005 opened this issue 3 months ago • 0 comments

Added multiply high signed/unsigned routines. These can be used to optimize division by a constant. __smulhu is optimized, but the rest are not well optimized. They use the exact same calling convention as the regular multiplication routines. We can optimize these routines in later PR's.

__smulhu   :         HL = ((uint32_t)         HL * (uint32_t)      BC) >> 16
__imulhu   :        UHL = ((uint48_t)        UHL * (uint48_t)     UBC) >> 24
__lmulhu   :      E:UHL = ((uint64_t)      E:UHL * (uint64_t)   A:UBC) >> 32
__i48mulhu :    UDE:UHL = ((uint96_t)    UDE:UHL * (uint96_t) UIY:UBC) >> 48
__llmulhu  : BC:UDE:UHL = ((uint128_t)BC:UDE:UHL * (uint128_t) (SP64)) >> 64

__smulhs   :         HL = ((int32_t)          HL * (int32_t)       BC) >> 16
__imulhs   :        UHL = ((int48_t)         UHL * (int48_t)      UBC) >> 24
__lmulhs   :      E:UHL = ((int64_t)       E:UHL * (int64_t)    A:UBC) >> 32
__i48mulhs :    UDE:UHL = ((int96_t)     UDE:UHL * (int96_t)  UIY:UBC) >> 48
__llmulhs  : BC:UDE:UHL = ((int128_t) BC:UDE:UHL * (int128_t)  (SP64)) >> 64

__smulhu   :  32 bytes |  33F +  12R +   9W +  17
__imulhu   : 117 bytes | 118F +  39R +  38W +  37
__lmulhu   : 1 call to __llmulu
__i48mulhu :  93 bytes | 902F + 246R + 182W + 344
__llmulhu  : (disables interrupts to use exx) slightly faster than 2 calls to __llmulu

__bmulhu was not added since it is just mlt bc \ ld a, b (and the 8-bit calling convention is not well defined).

Sep 30 '25 03:09 ZERICO2005