notepad4 icon indicating copy to clipboard operation
notepad4 copied to clipboard

AVX512 support?

Open missdeer opened this issue 9 months ago • 2 comments

The AVX2 edition is super fast, is it possible to have an AVX512 edition?

missdeer avatar Mar 12 '25 04:03 missdeer

AVX512 is uncommon in desktop CPU (especially Intel CPUs after 12 Gen).

zufuliu avatar Mar 16 '25 11:03 zufuliu

https://github.com/zufuliu/notepad4/pull/1006

missdeer avatar Apr 19 '25 09:04 missdeer

Build configuration for AVX512 is added by commit 01a16c5677325c098bf0a837a0738b98b9878ef9. it's tentative, will be replaced with AVX10.X (e.g. AVX10.2) in the further when the later become popular in desktop CPUs.

zufuliu avatar Jul 27 '25 09:07 zufuliu

Some code (line endings detection, brace matching, etc.) has been port to AVX512 (guarded by NP2_USE_AVX512), the remaining (UTF-8 validation and image processing) will be added later. I don't have access to computer that supports AVX512, so not sure whether these AVX512 code is faster than their AVX2 counterpart.

zufuliu avatar Aug 03 '25 07:08 zufuliu

I have a Windows 11 box supports AVX512, I would like help to test the new feature if you can provide efficient test cases.

missdeer avatar Aug 04 '25 05:08 missdeer

Just test big files (hundred MBs with word wrap off and LF / CR+LF line endings) on your system.

  1. enable console output, uncomment line above AttachConsole() inside wWinMain() in Notepad4.cpp: https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/src/Notepad4.cpp#L490

  2. EOL detection, uncomment code at end of EditLoadFile(): https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/src/Edit.cpp#L1173-L1180

  3. brace matching, uncomment code for case Message::BraceMatch: https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/scintilla/src/Editor.cxx#L8219-L8227

  4. insert string, uncomment code inside CellBuffer::InsertString() or CellBuffer::BasicInsertString(): https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/scintilla/src/CellBuffer.cxx#L446-L449 https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/scintilla/src/CellBuffer.cxx#L827 https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/scintilla/src/CellBuffer.cxx#L1213-L1215 for BasicInsertString(), AVX512 is currently disabled due to stack (compiled with cl /utf-8 /W4 /c /EHsc /std:c++20 /O2 /GS- /GR- /Gv /FAcs /DNDEBUG /DUNICODE /DNOMINMAX /I../include /I../lexlib /arch:AVX512 CellBuffer.cxx) is 200 bytes larger than AVX2: https://github.com/zufuliu/notepad4/blob/07874e0c9c6624ad15ae14ac664415aa4d6998ee/scintilla/src/CellBuffer.cxx#L844

zufuliu avatar Aug 04 '25 10:08 zufuliu

Here is the script I used to make big files:

import json

path = r'D:\Program Files\Microsoft Visual Studio\Packages\_Instances\953ef33b\catalog.json'
with open(path, encoding='utf-8') as fd:
	doc = fd.read()
catalog = json.loads(doc)

with open('catalog.json', 'w', encoding='utf-8') as fd:
	fd.write(doc)

pretty = json.dumps(catalog, ensure_ascii=False, indent='\t')
pretty = '\n' + pretty + '\n'
with open('pretty.json', 'w', encoding='utf-8') as fd:
	fd.write(pretty)

with open('catalog2.json', 'w', encoding='utf-8') as fd:
	fd.write(doc)
	fd.write(pretty)

big = {}
for i in range(30):
	big[i] = catalog
pretty = json.dumps(big, ensure_ascii=False, indent='\t')
pretty = '\n[\n(\n<\n' + pretty + '\n>\n)\n]\n'
with open('big.json', 'w', encoding='utf-8') as fd:
	fd.write(pretty)

pretty = pretty.encode('gbk', 'backslashreplace')
with open('big-gbk.json', 'wb') as fd:
	fd.write(pretty)

zufuliu avatar Aug 04 '25 10:08 zufuliu

Got it. I will try to run these test cases recently.

missdeer avatar Aug 05 '25 02:08 missdeer

simple quick test result:

>notepad4-avx2.exe big.json

C:\Users\HP\Downloads\notepad4-test>
H:\Projects\notepad2\src\Notepad4.cpp:495 wWinMain
Notepad4 EOL time: 37.976700
CR+LF: 17837319, LF: 0, CR: 0
BasicInsertString avx2=1, cache=256, perLine=1023, duration=58.045100
InsertString duration=145.876400
BraceMatch 2 / -1, 16606350 / 677007975, 28.856600
BraceMatch 5 / 23182016, 23181522 / 677007975, 0.955200
BraceMatch 8 / 677007966, 36735725 / 677007975, 16.262400
BraceMatch 11 / 677007963, 38494926 / 677007975, 37.358900
BraceMatch 20 / 22566941, 44255015 / 677007975, 2.368700
BraceMatch 674 / 2436, 65692067 / 677007975, 0.000600
BraceMatch 680 / 814, 78084145 / 677007975, 0.000500
BraceMatch 2454 / 22543705, 154572928 / 677007975, 3.739700
BraceMatch 2460 / 4534, 156856227 / 677007975, 0.001000
BraceMatch 22543724 / 22543933, 170396159 / 677007975, 0.000400
BraceMatch 22543952 / 22566937, 173210459 / 677007975, 0.006800
BraceMatch 22543970 / 22544144, 175255517 / 677007975, 0.000600

======================================================================
notepad4-avx2.exe big-gbk.json

H:\Projects\notepad2\src\Notepad4.cpp:495 wWinMain
Notepad4 EOL time: 26.452200
CR+LF: 0, LF: 17837319, CR: 0
BasicInsertString avx2=1, cache=256, perLine=1023, duration=50.733500
InsertString duration=130.058000
BraceMatch 1 / -1, 37170555 / 663802026, 37.276200
BraceMatch 3 / 52713878, 52298785 / 663802026, 2.221700
BraceMatch 5 / 663802020, 59846479 / 663802026, 19.668100
BraceMatch 7 / 663802018, 80585195 / 663802026, 42.052400
BraceMatch 5 / 663802020, 85119126 / 663802026, 31.121200
BraceMatch 1 / -1, 95948811 / 663802026, 39.473900
BraceMatch 15 / 22126739, 115442847 / 663802026, 4.701800
BraceMatch 633 / 91, 165515441 / 663802026, 0.000600
BraceMatch 649 / 2364, 176147389 / 663802026, 0.000700
BraceMatch 1032 / 1164, 207105042 / 663802026, 0.000700
BraceMatch 22103559 / 22103764, 246998259 / 663802026, 0.000500
BraceMatch 2381 / 22103541, 251947793 / 663802026, 4.444000

======================================================================
notepad4-avx512.exe big.json

H:\Projects\notepad2\src\Notepad4.cpp:495 wWinMain
Notepad4 EOL time: 40.889300
CR+LF: 17837319, LF: 0, CR: 0
BasicInsertString avx2=1, cache=256, perLine=1023, duration=61.784000
InsertString duration=161.582300
BraceMatch 2 / -1, 16432709 / 677007975, 31.505500
BraceMatch 5 / 24935957, 24794339 / 677007975, 2.519400
BraceMatch 8 / 677007966, 36954773 / 677007975, 14.782100
BraceMatch 11 / 677007963, 38701202 / 677007975, 41.382900
BraceMatch 20 / 22566941, 45498942 / 677007975, 2.217700
BraceMatch 674 / 2436, 83889867 / 677007975, 0.000800
BraceMatch 8 / 677007966, 677007975 / 677007975, 14.773800
BraceMatch 11 / 677007963, 677007975 / 677007975, 69.885200
BraceMatch 20 / 22566941, 677007975 / 677007975, 2.742900
BraceMatch 2454 / 22543705, 677007975 / 677007975, 2.213800
BraceMatch 2460 / 4534, 677007975 / 677007975, 0.001000
BraceMatch 22543952 / 22566937, 677007975 / 677007975, 0.003200
BraceMatch 22543970 / 22544144, 677007975 / 677007975, 0.000500

======================================================================
notepad4-avx512.exe big-gbk.json

H:\Projects\notepad2\src\Notepad4.cpp:495 wWinMain
Notepad4 EOL time: 20.167400
CR+LF: 0, LF: 17837319, CR: 0
BasicInsertString avx2=1, cache=256, perLine=1023, duration=49.942100
InsertString duration=141.289800
BraceMatch 1 / -1, 16133616 / 663802026, 34.732700
BraceMatch 3 / 25321651, 24560536 / 663802026, 1.218400
BraceMatch 5 / 663802020, 36304327 / 663802026, 15.959800
BraceMatch 7 / 663802018, 38896663 / 663802026, 35.508500
BraceMatch 15 / 22126739, 46926987 / 663802026, 2.065700
BraceMatch 649 / 2364, 89402873 / 663802026, 0.000800
BraceMatch 2381 / 22103541, 109116522 / 663802026, 2.249700
BraceMatch 2381 / 22103541, 109116522 / 663802026, 3.724500
BraceMatch 2386 / 4395, 114974590 / 663802026, 0.001100
BraceMatch 22103559 / 22103764, 121131402 / 663802026, 0.000500
BraceMatch 22103782 / 22126736, 125064168 / 663802026, 0.002100
BraceMatch 3 / 141220810, 140834221 / 663802026, 12.278700
BraceMatch 1 / -1, 141898139 / 663802026, 50.197100

CPU info:

C:\Users\HP>wmic cpu list full


AddressWidth=64
Architecture=9
Availability=3
Caption=AMD64 Family 25 Model 117 Stepping 2
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2646
CurrentVoltage=12
DataWidth=64
Description=AMD64 Family 25 Model 117 Stepping 2
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=100
Family=107
InstallDate=
L2CacheSize=8192
L2CacheSpeed=
LastErrorCode=
Level=25
LoadPercentage=10
Manufacturer=AuthenticAMD
MaxClockSpeed=3301
Name=AMD Ryzen 7 8840HS w/ Radeon 780M Graphics
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=178BFBFF00A70F52
ProcessorType=3
Revision=29954
Role=CPU
SocketDesignation=FP8
Status=OK
StatusInfo=3
Stepping=2
SystemCreationClassName=Win32_ComputerSystem
SystemName=DESKTOP-CRIAF8P
UniqueId=
UpgradeMethod=6
Version=Model 5, Stepping 2
VoltageCaps=

Test with Notepad4 25.07 r5738 (08d0b0f0) with MSVC 19.44.35214.0

missdeer avatar Aug 09 '25 03:08 missdeer

Test with Notepad4 25.07 r5738 (08d0b0f) with MSVC 19.44.35214.0

Seems that the about dialog is not rebuilt as expected, the commit should be c774a0ba45cfc5759f8d881b7ef84a9e2641e0dd :

commit c774a0ba45cfc5759f8d881b7ef84a9e2641e0dd (HEAD -> main, origin/main, origin/HEAD)
Author: zufuliu <[email protected]>
Date:   Fri Aug 8 21:32:53 2025 +0800

    Fix wrong pixel count for `RGBAImage::BGRAFromRGBA()`.

missdeer avatar Aug 09 '25 03:08 missdeer

simple quick test result:

Thanks, so it looks AVX512 is not obviously faster than AVX2.

Seems that the about dialog is not rebuilt

It will be rebuilt after Dialogs.cpp or it's includes changed, CI has a step to update reversion: https://github.com/zufuliu/notepad4/blob/8e4af459957c4918334aad081b462633b80fd937/.github/workflows/main.yml#L16-L18

zufuliu avatar Aug 09 '25 06:08 zufuliu

Thanks, so it looks AVX512 is not obviously faster than AVX2.

It seems that my computer is a laptop, and the AVX-512 instruction set may cause the CPU to throttle, so its performance boost might not be as high as expected.

missdeer avatar Aug 09 '25 08:08 missdeer

and the AVX-512 instruction set may cause the CPU to throttle, so its performance boost might not be as high as expected.

I don't have more on this, maybe it will faster on server CPU (only has P cores?).

I abandoned AVX512 version UTF-8 validation:

  1. The code doesn't compile with just -march=x86-64-v4 (Clang) due to the use of _mm512_permutex2var_epi8.
  2. The code from https://github.com/lemire/fastvalidate-utf-8 is complex, the code from https://github.com/zwegner/faster-utf8-validator doesn't pass the tests (see https://github.com/zwegner/faster-utf8-validator/issues/6).

zufuliu avatar Aug 17 '25 01:08 zufuliu

2. The code from https://github.com/lemire/fastvalidate-utf-8 is complex, the code from https://github.com/zwegner/faster-utf8-validator doesn't pass the tests (see [AVX512 version form wip branch doesn't pass the tests zwegner/faster-utf8-validator#6](https://github.com/zwegner/faster-utf8-validator/issues/6)).

How about this library: https://github.com/simdutf/

missdeer avatar Aug 17 '25 11:08 missdeer

How about this library: https://github.com/simdutf/

The C++ template code (also https://github.com/simdutf/is_utf8) is hard to isolate only the AVX515 parts.

zufuliu avatar Aug 17 '25 11:08 zufuliu

Close this for now, AVX512 UTF-8 validation could be implemented later.

zufuliu avatar Sep 13 '25 07:09 zufuliu