notepad4 icon indicating copy to clipboard operation
notepad4 copied to clipboard

add AVX512 support

Open missdeer opened this issue 8 months ago • 5 comments

Keep the original SSE2 & AVX2 optimized edition, add AVX512 support.

missdeer avatar Apr 17 '25 06:04 missdeer

AVX512 (x86-64-v4) implies AVX2 (x86-64-v3) and SSE4 (x86-64-v2), so you can keep existing AVX2/SSE4 code as is (keep #define NP2_USE_AVX2 1) and only add new code for AVX512 if it's faster or smaller than existing AVX2/SSE4 code.

zufuliu avatar Apr 18 '25 11:04 zufuliu

AVX512 (x86-64-v4) implies AVX2 (x86-64-v3) and SSE4 (x86-64-v2), so you can keep existing AVX2/SSE4 code as is (keep #define NP2_USE_AVX2 1) and only add new code for AVX512 if it's faster or smaller than existing AVX2/SSE4 code.

Agreed.

missdeer avatar Apr 19 '25 04:04 missdeer

It seems process 8 pixels at a time is unsafe for some DPI / scaling: https://github.com/zufuliu/notepad4/blob/fa7f72499a8f825af5856567ba4a1aab6aea9d12/scintilla/include/GraphicUtils.h#L62-L63

consider a system with 144 dpi (icon base size is 24x24 in HD build), 125% scaling got 30x30/8 => 112.5.

zufuliu avatar Apr 19 '25 22:04 zufuliu

It seems process 8 pixels at a time is unsafe for some DPI / scaling:

https://github.com/zufuliu/notepad4/blob/fa7f72499a8f825af5856567ba4a1aab6aea9d12/scintilla/include/GraphicUtils.h#L62-L63

consider a system with 144 dpi (icon base size is 24x24 in HD build), 125% scaling got 30x30/8 => 112.5.

是的,您的分析是正确的。在处理位图时,特别是在高DPI和缩放的情况下,需要特别注意像素处理的边界问题。 具体来说:

  • 在144 DPI + 125%缩放的情况下:
  • 24x24的图标实际会显示为30x30像素
  • 30/8 = 3.75,这意味着最后一组像素只有6个(30 - 38 = 6)
  • 这可能导致内存访问越界或需要额外的边界检查

但实际上,处理4个像素和处理8个像素在边界情况下的本质是一样的:

  • 两者都会遇到最后一组不完整的情况
  • 都需要进行边界检查
  • 都需要特殊处理最后一组像素

主要区别在于:

  • 处理8个像素时,最后一组有6个像素
  • 处理4个像素时,最后一组只有2个像素
  • 从实现角度看,处理6个像素可能比处理2个像素更容易

性能考虑:

  • 处理8个像素可能比处理4个像素更高效
  • 减少了循环次数
  • 减少了边界检查的次数

结论:

  • 只要实现时做好边界检查
  • 处理8个像素也是可行的
  • 甚至可能比处理4个像素更高效

所以只要正确处理边界情况,处理8个像素并不会比处理4个像素更不安全。这更多是一个实现细节的问题,而不是根本性的安全问题。

missdeer avatar Apr 20 '25 02:04 missdeer

For 32bpp bitmap, no padding bytes on each row, so the code just treat the image has bmp.bmHeight * bmp.bmWidth pixels and process them as if they are on same row (pixel location doesn't affect final result).

For 144 DPI + 125%, 30x30/4 = 255, so no special handling needed on last block when processed 4 pixels at a time. AVX512 code will need to handling potential 4 pixels (load/save) on last block (no need to handle scaling other than 100%, 125%, 150%, 175%, 200%).

zufuliu avatar Apr 20 '25 03:04 zufuliu

AVX512 version RGBAImage::BGRAFromRGBA() and BitmapAlphaBlend() are merged (with minor fixes, https://github.com/zufuliu/notepad4/blob/main/scintilla/test/GraphicTest2.cpp can be changed to test the code), they are disabled:

  1. process 8 pixels at a time may left 4 pixels unhandled.
  2. speed may not much faster than AVX2, it can be measured (using the largest toolbar image Toolbar48.bmp) at caller site inside CreateBars(): https://github.com/zufuliu/notepad4/blob/4a1d6f3833a6302b5459667b97a1a3eab55022a3/src/Notepad4.cpp#L1918-L1922

The other unused code will not be merged.

zufuliu avatar Aug 09 '25 11:08 zufuliu