Use aligned loads in get_partial_safe
In get_partial_safe, it's possible to use aligned loads by declaring the buffer as MaybeUninit<State> and then casting the pointer for std::ptr::copy and zeroing the rest of the buffer, instead of declaring the buffer as an u8 array.
https://github.com/ogxd/gxhash/blob/b2b9d24eb35a48a2a18b1498f48693e523533200/src/gxhash/platform/x86_128.rs#L35-L39
From what I can see there is a compiler optimization that stack allocates [0i8; VECTOR_SIZE] instead of heap allocating (probably because VECTOR_SIZE is a constant), so MaybeUninit<State> may not be faster.
About to close this one unless someone has some snippet to propose?
I tried using a struct which contains only the byte array and is marked as #[repr(align(16))]. I have not tested the performance yet, but this should still allocate on the stack and force 16-byte alignment.
Closing this as proposed solution does not provides significant performance gains nor simplifies the code. Feel free to open another issue if you have something to suggest.