add AVX512 support
Keep the original SSE2 & AVX2 optimized edition, add AVX512 support.
AVX512 (x86-64-v4) implies AVX2 (x86-64-v3) and SSE4 (x86-64-v2), so you can keep existing AVX2/SSE4 code as is (keep #define NP2_USE_AVX2 1) and only add new code for AVX512 if it's faster or smaller than existing AVX2/SSE4 code.
AVX512 (
x86-64-v4) implies AVX2 (x86-64-v3) and SSE4 (x86-64-v2), so you can keep existing AVX2/SSE4 code as is (keep#define NP2_USE_AVX2 1) and only add new code for AVX512 if it's faster or smaller than existing AVX2/SSE4 code.
Agreed.
It seems process 8 pixels at a time is unsafe for some DPI / scaling: https://github.com/zufuliu/notepad4/blob/fa7f72499a8f825af5856567ba4a1aab6aea9d12/scintilla/include/GraphicUtils.h#L62-L63
consider a system with 144 dpi (icon base size is 24x24 in HD build), 125% scaling got 30x30/8 => 112.5.
It seems process 8 pixels at a time is unsafe for some DPI / scaling:
https://github.com/zufuliu/notepad4/blob/fa7f72499a8f825af5856567ba4a1aab6aea9d12/scintilla/include/GraphicUtils.h#L62-L63
consider a system with 144 dpi (icon base size is 24x24 in HD build), 125% scaling got 30x30/8 => 112.5.
是的,您的分析是正确的。在处理位图时,特别是在高DPI和缩放的情况下,需要特别注意像素处理的边界问题。 具体来说:
- 在144 DPI + 125%缩放的情况下:
- 24x24的图标实际会显示为30x30像素
- 30/8 = 3.75,这意味着最后一组像素只有6个(30 - 38 = 6)
- 这可能导致内存访问越界或需要额外的边界检查
但实际上,处理4个像素和处理8个像素在边界情况下的本质是一样的:
- 两者都会遇到最后一组不完整的情况
- 都需要进行边界检查
- 都需要特殊处理最后一组像素
主要区别在于:
- 处理8个像素时,最后一组有6个像素
- 处理4个像素时,最后一组只有2个像素
- 从实现角度看,处理6个像素可能比处理2个像素更容易
性能考虑:
- 处理8个像素可能比处理4个像素更高效
- 减少了循环次数
- 减少了边界检查的次数
结论:
- 只要实现时做好边界检查
- 处理8个像素也是可行的
- 甚至可能比处理4个像素更高效
所以只要正确处理边界情况,处理8个像素并不会比处理4个像素更不安全。这更多是一个实现细节的问题,而不是根本性的安全问题。
For 32bpp bitmap, no padding bytes on each row, so the code just treat the image has bmp.bmHeight * bmp.bmWidth pixels and process them as if they are on same row (pixel location doesn't affect final result).
For 144 DPI + 125%, 30x30/4 = 255, so no special handling needed on last block when processed 4 pixels at a time. AVX512 code will need to handling potential 4 pixels (load/save) on last block (no need to handle scaling other than 100%, 125%, 150%, 175%, 200%).
AVX512 version RGBAImage::BGRAFromRGBA() and BitmapAlphaBlend() are merged (with minor fixes, https://github.com/zufuliu/notepad4/blob/main/scintilla/test/GraphicTest2.cpp can be changed to test the code), they are disabled:
- process 8 pixels at a time may left 4 pixels unhandled.
- speed may not much faster than AVX2, it can be measured (using the largest toolbar image Toolbar48.bmp) at caller site inside
CreateBars(): https://github.com/zufuliu/notepad4/blob/4a1d6f3833a6302b5459667b97a1a3eab55022a3/src/Notepad4.cpp#L1918-L1922
The other unused code will not be merged.