Descent3 icon indicating copy to clipboard operation
Descent3 copied to clipboard

Use compiler intrinsic byteswapping functions

Open GravisZro opened this issue 10 months ago • 0 comments

I'm not trying to be stubborn but the current C++ byteswap code is a lot slower than using compiler intrinsic functions.

Converts INTEL_* and MOTOROLA_* macros to use the proper compiler intrinsic functions.

List of macro aliases: From POSIX for networking conversions:

  • host to network (16-bit) : htons()
  • host to network (32-bit) : htonl()
  • network to host (16-bit) : ntohs()
  • network to host (32-bit) : ntohl()

From GNU for endian conversions:

  • host to little endian (16-bit) : htole16()
  • host to little endian (32-bit) : htole32()
  • host to little endian (64-bit) : htole64()
  • little endian to host (16-bit) : le16toh()
  • little endian to host (32-bit) : le32toh()
  • little endian to host (64-bit) : le64toh()

Comparison to existing code

You can see the assembly output here

C++ byteswap implementation

unsigned short D3::byteswap<unsigned short>(unsigned short):
        push    rbp
        mov     rbp, rsp
        mov     eax, edi
        mov     WORD PTR [rbp-20], ax
        mov     QWORD PTR [rbp-8], 0
        jmp     .L12
.L13:
        mov     eax, 1
        sub     rax, QWORD PTR [rbp-8]
        lea     rdx, [rbp-20]
        add     rax, rdx
        lea     rcx, [rbp-10]
        mov     rdx, QWORD PTR [rbp-8]
        add     rdx, rcx
        movzx   eax, BYTE PTR [rax]
        mov     BYTE PTR [rdx], al
        add     QWORD PTR [rbp-8], 1
.L12:
        cmp     QWORD PTR [rbp-8], 1
        jbe     .L13
        movzx   eax, WORD PTR [rbp-10]
        pop     rbp
        ret

compiler intrinsic byteswap implementation

__bswap_16:
        push    rbp
        mov     rbp, rsp
        mov     eax, edi
        mov     WORD PTR [rbp-4], ax
        movzx   eax, WORD PTR [rbp-4]
        rol     ax, 8
        pop     rbp
        ret

GravisZro avatar Apr 23 '24 23:04 GravisZro