Use unaligned.h loads for qb4w scalar ksum loads
Summary
Update qb4w-family scalar GEMM kernels to use unaligned.h unaligned load methods to load ksums. This is likely an oversight in the initial scalar implementation, as the ksums are only guaranteed to be 16-bit aligned.
The scalar kernels should only be used as a fallback, so this change should have minimal to no impact on ARM or x86 targets. In theory, we could pad out the packed weights slightly to guarantee 32-bit alignment of the ksums, but it looks like the scalar kernel already uses unaligned loads for non-multiple-of-4 NR values. I'm happy to do a deeper analysis if desired.
Test Plan
I locally built and ran the tests on an M4 Mac with CMake. There are two failures, but I confirmed that these failures are pre-existing on the parent commit.
The following tests FAILED:
291 - f32-vgelu-test (Failed)
433 - subgraph-fp16-test (Failed)
@fbarchard Here's the unaligned load fix.