starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

Consider replacing src/util/simdutf8check.h

Open lemire opened this issue 2 years ago • 13 comments

The code in src/util/simdutf8check.h is suboptimal. It could be replaced either by...

  1. simdutf which provides full support for Unicode (transcoding, validation, and so forth).
  2. is_utf8 which provides just the check for UTF-8 validity.

Both of these libraries have runtime dispatching, support for NEON, AVX-512 and so forth, with extensive testing.

Switching to simdutf would improve the performance.

lemire avatar Dec 30 '22 21:12 lemire

Thanks for advice. We would benchmark it later

murphyatwork avatar Dec 30 '22 21:12 murphyatwork

I got some compiling error with simdutf, @lemire

  • Error: Error: no such instruction: ·vpcompressb %zmm0,%zmm1{%k6}'
  • GCC 10.3.0

CPU:

➤ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                104
On-line CPU(s) list:   0-103
Thread(s) per core:    2
Core(s) per socket:    26
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:              7

murphyatwork avatar Jan 05 '23 02:01 murphyatwork

And it works on GCC 11.3.

murphyatwork avatar Jan 05 '23 02:01 murphyatwork

@mofeiatwork

If it is a compile-time error, then it is unrelated to your CPU (i.e., lscpu is not relevant) because we don't compile for the host target. It is also not related to the compiler.

Does your assembler match your compiler? If you mix a recent compiler with an incompatible/old assembler (e.g., when using a recent compiler with an old Linux distribution), you may get errors at build time because the compiler produces instructions that the assembler does not recognize: you should update your assembler to match your compiler (e.g., upgrade binutils to version 2.30 or better under Linux) or use an older compiler matching the capabilities of your assembler.

Unrecognized instructions at compile time are usually the symptom of an unsupported mix of tools. E.g., if your compiler knows about some fancy instructions that your assembler does not yet know about, you will get trouble.

Does that help?

lemire avatar Jan 05 '23 13:01 lemire

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

github-actions[bot] avatar Jul 10 '23 11:07 github-actions[bot]

Note that simdutf is part of Node.js which is built and used on millions of systems.

If you cannot build simdutf, the most likely cause is a bad toolchain.

lemire avatar Jul 24 '23 13:07 lemire

I got some compiling error with simdutf, @lemire

  • Error: Error: no such instruction: ·vpcompressb %zmm0,%zmm1{%k6}'
  • GCC 10.3.0

CPU:

➤ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                104
On-line CPU(s) list:   0-103
Thread(s) per core:    2
Core(s) per socket:    26
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:              7

@mofeiatwork saw the similar error on our centos7 toolchain, it works on our ubuntu22 toolchains, maybe it's time to upgrade our centos7 toolchain to gcc11 ?

kevincai avatar Aug 19 '23 10:08 kevincai

It is almost certainly the case that you are mixing an old code generator with a more recent compiler. Linux distributions upgrade the two in sync, but you can design systems that have old code generators and very recent compiler. Check your gas/nasm version.

The problem often is that the compiler does not know about the code generator, so it asks the code generator to produce instructions it does know about.

Quoting from the simdutf documentation:

AVX-512 support require a processor with AVX512-VBMI2 (Ice Lake or better) and a recent compiler (GCC 8 or better, Visual Studio 2019 or better, LLVM clang 6 or better). You need a correspondingly recent assembler such as gas (2.30+) or nasm (2.14+): recent compilers usually come with recent assemblers. If you mix a recent compiler with an incompatible/old assembler (e.g., when using a recent compiler with an old Linux distribution), you may get errors at build time because the compiler produces instructions that the assembler does not recognize: you should update your assembler to match your compiler (e.g., upgrade binutils to version 2.30 or better under Linux) or use an older compiler matching the capabilities of your assembler.

lemire avatar Aug 20 '23 17:08 lemire

it could be, we are using centos7 image, with default binutils and manual compiled gcc10. It doesn't work when use gcc10 + default binutils (include the as), but works when upgrades to gcc11 + default binutils.

still can't understand what's the cause.

kevincai avatar Aug 21 '23 02:08 kevincai

plan to upgrade gcc in the toolchain from gcc10.3 to gcc11.3 #29552

kevincai avatar Aug 21 '23 03:08 kevincai

it could be, we are using centos7 image, with default binutils and manual compiled gcc10. It doesn't work when use gcc10 + default binutils (include the as), but works when upgrades to gcc11 + default binutils.

still can't understand what's the cause.

appears that the default as in centos7 is too old (version 2.27), need to upgrade the binutils to fix the assembler error.

kevincai avatar Aug 21 '23 09:08 kevincai

with #29813 change, the simdutf project can be built successful with no error.

$ docker run -it --rm -v `pwd`/simdutf:/root/simdutf starrocks/dev-env-centos7:latest bash
[root@c0315012778a ~]# cd /root/simdutf/
[root@c0315012778a simdutf]# cmake -B build
-- The CXX compiler identification is GNU 10.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rh/gcc-toolset-10/root/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- No build type selected, default to Release
-- Found Python3: /usr/bin/python3.6 (found version "3.6.8") found components: Interpreter 
-- Python found, we are going to amalgamate.py.
-- The tests are enabled.
-- Performing Test Iconv_IS_BUILT_IN
-- Performing Test Iconv_IS_BUILT_IN - Success
-- Found Iconv: built in to C library  
-- Iconv was found!
-- Iconv is part of the C library.
-- looking for static C++ library in /opt/rh/gcc-toolset-10/root/usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0
-- libstdc++.a not found
-- The benchmarks can be disabled by setting SIMDUTF_BENCHMARKS, e.g., -D SIMDUTF_BENCHMARKS=OFF.
-- Found the following ICU libraries:
--   uc (required)
-- Failed to find all ICU components (missing: ICU_INCLUDE_DIR ICU_LIBRARY) 
-- We rely on the system's ICU. It was not found!
-- Iconv was found!
-- Iconv is part of the C library.
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Compiling using the C++ standard:11
-- Configuring done
-- Generating done
-- Build files have been written to: /root/simdutf/build
[root@c0315012778a simdutf]# cd build
[root@c0315012778a build]# make -j 10
[  1%] Building CXX object src/CMakeFiles/simdutf.dir/simdutf.cpp.o
[  1%] Linking CXX static library libsimdutf.a
[  1%] Built target simdutf
[  2%] Generating simdutf.cpp, simdutf.h, amalgamation_demo.cpp, README.md
[  3%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/encode_utf8.cpp.o
[  4%] Building CXX object tools/CMakeFiles/sutf.dir/sutf.cpp.o
[  5%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/encode_utf16.cpp.o
[  5%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/encode_utf32.cpp.o
[  6%] Building CXX object benchmarks/CMakeFiles/stream.dir/stream.cpp.o
[  7%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/encode_latin1.cpp.o
[  7%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf8_to_latin1.cpp.o
[  7%] Building CXX object benchmarks/CMakeFiles/alignment.dir/alignment.cpp.o
[  8%] Building CXX object benchmarks/CMakeFiles/threaded.dir/threaded.cpp.o
[  9%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf16_to_latin1.cpp.o
[ 10%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf32_to_latin1.cpp.o
[ 10%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf8.cpp.o
[ 11%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf16.cpp.o
[ 11%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_utf32.cpp.o
SCRIPTPATH=/root/simdutf/singleheader PROJECTPATH=/root/simdutf
We are about to amalgamate all simdutf files into one source file.
See https://www.sqlite.org/amalgamation.html and https://en.wikipedia.org/wiki/Single_Compilation_Unit for rationale.
[ 12%] Building CXX object tests/reference/CMakeFiles/simdutf_tests_reference.dir/validate_latin1.cpp.o
timestamp is 2023-08-11 14:01:05 -0400
Creating /root/simdutf/build/singleheader/simdutf.h


Creating /root/simdutf/build/singleheader/simdutf.cpp
[ 12%] Linking CXX static library libsimdutf_tests_reference.a
[ 12%] Built target simdutf_tests_reference
[ 13%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/test.cpp.o
[ 14%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/transcode_test_base.cpp.o
[ 14%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/random_int.cpp.o
[ 15%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/random_utf8.cpp.o
[ 15%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/random_utf16.cpp.o
Done with all files generation.
Files have been written to directory: /root/simdutf/build/singleheader/
Done with all files generation.

Giving final instructions:




While in the singleheader directory under a linux or macOS system with an install toolchain, try:





c++ -o amalgamation_demo amalgamation_demo.cpp -std=c++17 && ./amalgamation_demo


[ 15%] Built target singleheader-files
[ 16%] Building CXX object tests/helpers/CMakeFiles/simdutf_tests_helpers.dir/random_utf32.cpp.o
[ 17%] Building CXX object singleheader/CMakeFiles/amalgamation_demo.dir/amalgamation_demo.cpp.o
[ 17%] Linking CXX executable stream
[ 18%] Linking CXX executable threaded
[ 18%] Built target stream
[ 19%] Linking CXX executable alignment
[ 19%] Built target threaded
[ 19%] Built target alignment
[ 19%] Linking CXX static library libsimdutf_tests_helpers.a
[ 19%] Built target simdutf_tests_helpers
[ 19%] Building CXX object tests/CMakeFiles/random_fuzzer.dir/random_fuzzer.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/special_tests.dir/special_tests.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/validate_ascii_basic_tests.dir/validate_ascii_basic_tests.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/validate_ascii_with_errors_tests.dir/validate_ascii_with_errors_tests.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/bele_tests.dir/bele_tests.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/validate_utf8_basic_tests.dir/validate_utf8_basic_tests.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/select_implementation.dir/select_implementation.cpp.o
[ 21%] Building CXX object tests/CMakeFiles/validate_utf8_brute_force_tests.dir/validate_utf8_brute_force_tests.cpp.o
[ 21%] Linking CXX executable sutf
[ 22%] Linking CXX executable select_implementation
[ 23%] Linking CXX executable validate_ascii_basic_tests
[ 23%] Built target sutf
[ 24%] Building CXX object tests/CMakeFiles/validate_utf8_puzzler_tests.dir/validate_utf8_puzzler_tests.cpp.o
[ 25%] Linking CXX executable validate_utf8_basic_tests
[ 25%] Built target select_implementation
[ 26%] Building CXX object tests/CMakeFiles/validate_utf8_with_errors_tests.dir/validate_utf8_with_errors_tests.cpp.o
[ 26%] Built target validate_ascii_basic_tests
[ 26%] Built target validate_utf8_basic_tests
[ 27%] Building CXX object tests/CMakeFiles/validate_utf16le_basic_tests.dir/validate_utf16le_basic_tests.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/validate_utf16be_basic_tests.dir/validate_utf16be_basic_tests.cpp.o
[ 28%] Linking CXX executable validate_utf8_brute_force_tests
[ 29%] Linking CXX executable validate_ascii_with_errors_tests
[ 29%] Built target validate_utf8_brute_force_tests
[ 30%] Building CXX object tests/CMakeFiles/validate_utf16le_with_errors_tests.dir/validate_utf16le_with_errors_tests.cpp.o
[ 30%] Built target validate_ascii_with_errors_tests
[ 31%] Building CXX object tests/CMakeFiles/validate_utf16be_with_errors_tests.dir/validate_utf16be_with_errors_tests.cpp.o
[ 32%] Linking CXX executable bele_tests
[ 33%] Linking CXX executable random_fuzzer
[ 33%] Built target bele_tests
[ 33%] Building CXX object tests/CMakeFiles/validate_utf32_basic_tests.dir/validate_utf32_basic_tests.cpp.o
[ 33%] Built target random_fuzzer
[ 33%] Linking CXX executable special_tests
[ 33%] Building CXX object tests/CMakeFiles/validate_utf32_with_errors_tests.dir/validate_utf32_with_errors_tests.cpp.o
[ 33%] Built target special_tests
[ 34%] Building CXX object tests/CMakeFiles/convert_latin1_to_utf8_tests.dir/convert_latin1_to_utf8_tests.cpp.o
[ 34%] Linking CXX executable validate_utf8_puzzler_tests
[ 34%] Built target validate_utf8_puzzler_tests
[ 35%] Building CXX object tests/CMakeFiles/convert_latin1_to_utf16le_tests.dir/convert_latin1_to_utf16le_tests.cpp.o
[ 36%] Linking CXX executable validate_utf8_with_errors_tests
[ 36%] Built target validate_utf8_with_errors_tests
[ 36%] Building CXX object tests/CMakeFiles/convert_latin1_to_utf16be_tests.dir/convert_latin1_to_utf16be_tests.cpp.o
[ 36%] Linking CXX executable validate_utf16be_basic_tests
[ 36%] Built target validate_utf16be_basic_tests
[ 36%] Linking CXX executable convert_latin1_to_utf8_tests
[ 37%] Building CXX object tests/CMakeFiles/convert_latin1_to_utf32_tests.dir/convert_latin1_to_utf32_tests.cpp.o
[ 37%] Linking CXX executable validate_utf16le_basic_tests
[ 38%] Linking CXX executable validate_utf32_basic_tests
[ 38%] Built target convert_latin1_to_utf8_tests
[ 38%] Building CXX object tests/CMakeFiles/convert_utf8_to_latin1_tests.dir/convert_utf8_to_latin1_tests.cpp.o
[ 38%] Built target validate_utf32_basic_tests
[ 38%] Building CXX object tests/CMakeFiles/convert_utf8_to_latin1_with_errors_tests.dir/convert_utf8_to_latin1_with_errors_tests.cpp.o
[ 38%] Linking CXX executable validate_utf16be_with_errors_tests
[ 39%] Linking CXX executable validate_utf16le_with_errors_tests
[ 40%] Linking CXX executable validate_utf32_with_errors_tests
[ 40%] Built target validate_utf16be_with_errors_tests
[ 41%] Building CXX object tests/CMakeFiles/convert_valid_utf8_to_latin1_tests.dir/convert_valid_utf8_to_latin1_tests.cpp.o
[ 41%] Built target validate_utf16le_with_errors_tests
[ 41%] Built target validate_utf32_with_errors_tests
[ 41%] Linking CXX executable convert_latin1_to_utf16le_tests
[ 41%] Building CXX object tests/CMakeFiles/convert_valid_utf8_to_utf16le_tests.dir/convert_valid_utf8_to_utf16le_tests.cpp.o
[ 41%] Building CXX object tests/CMakeFiles/convert_valid_utf8_to_utf16be_tests.dir/convert_valid_utf8_to_utf16be_tests.cpp.o
[ 41%] Built target validate_utf16le_basic_tests
[ 41%] Building CXX object tests/CMakeFiles/convert_valid_utf8_to_utf32_tests.dir/convert_valid_utf8_to_utf32_tests.cpp.o
[ 41%] Built target convert_latin1_to_utf16le_tests
[ 42%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf16le_tests.dir/convert_utf8_to_utf16le_tests.cpp.o
[ 43%] Linking CXX executable convert_latin1_to_utf16be_tests
[ 43%] Built target convert_latin1_to_utf16be_tests
[ 43%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf16be_tests.dir/convert_utf8_to_utf16be_tests.cpp.o
[ 43%] Linking CXX executable convert_latin1_to_utf32_tests
[ 43%] Built target convert_latin1_to_utf32_tests
[ 44%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf16le_with_errors_tests.dir/convert_utf8_to_utf16le_with_errors_tests.cpp.o
[ 45%] Linking CXX executable convert_utf8_to_latin1_tests
[ 46%] Linking CXX executable convert_valid_utf8_to_latin1_tests
[ 46%] Built target convert_utf8_to_latin1_tests
[ 47%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf16be_with_errors_tests.dir/convert_utf8_to_utf16be_with_errors_tests.cpp.o
[ 47%] Built target convert_valid_utf8_to_latin1_tests
[ 47%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf32_tests.dir/convert_utf8_to_utf32_tests.cpp.o
[ 48%] Linking CXX executable convert_utf8_to_latin1_with_errors_tests
[ 49%] Linking CXX executable convert_valid_utf8_to_utf16be_tests
[ 49%] Built target convert_utf8_to_latin1_with_errors_tests
[ 49%] Building CXX object tests/CMakeFiles/convert_utf8_to_utf32_with_errors_tests.dir/convert_utf8_to_utf32_with_errors_tests.cpp.o
[ 49%] Built target convert_valid_utf8_to_utf16be_tests
[ 50%] Building CXX object tests/CMakeFiles/convert_utf16le_to_latin1_tests.dir/convert_utf16le_to_latin1_tests.cpp.o
[ 51%] Linking CXX executable convert_valid_utf8_to_utf32_tests
[ 52%] Linking CXX executable convert_valid_utf8_to_utf16le_tests
[ 52%] Built target convert_valid_utf8_to_utf32_tests
[ 52%] Built target convert_valid_utf8_to_utf16le_tests
[ 53%] Building CXX object tests/CMakeFiles/convert_utf16be_to_latin1_tests.dir/convert_utf16be_to_latin1_tests.cpp.o
[ 53%] Building CXX object tests/CMakeFiles/convert_utf16le_to_latin1_tests_with_errors.dir/convert_utf16le_to_latin1_tests_with_errors.cpp.o
[ 53%] Linking CXX executable convert_utf8_to_utf16le_tests
[ 54%] Linking CXX executable convert_utf8_to_utf16be_tests
[ 54%] Built target convert_utf8_to_utf16le_tests
[ 54%] Built target convert_utf8_to_utf16be_tests
[ 54%] Building CXX object tests/CMakeFiles/convert_utf16be_to_latin1_tests_with_errors.dir/convert_utf16be_to_latin1_tests_with_errors.cpp.o
[ 55%] Building CXX object tests/CMakeFiles/convert_valid_utf16le_to_latin1_tests.dir/convert_valid_utf16le_to_latin1_tests.cpp.o
[ 56%] Linking CXX executable convert_utf8_to_utf32_tests
[ 56%] Built target convert_utf8_to_utf32_tests
[ 56%] Building CXX object tests/CMakeFiles/convert_valid_utf16be_to_latin1_tests.dir/convert_valid_utf16be_to_latin1_tests.cpp.o
[ 57%] Linking CXX executable convert_utf8_to_utf16le_with_errors_tests
[ 57%] Built target convert_utf8_to_utf16le_with_errors_tests
[ 58%] Building CXX object tests/CMakeFiles/convert_utf16le_to_utf8_tests.dir/convert_utf16le_to_utf8_tests.cpp.o
[ 59%] Linking CXX executable convert_utf16le_to_latin1_tests
[ 60%] Linking CXX executable convert_utf16le_to_latin1_tests_with_errors
[ 60%] Built target convert_utf16le_to_latin1_tests
[ 61%] Building CXX object tests/CMakeFiles/convert_utf16be_to_utf8_tests.dir/convert_utf16be_to_utf8_tests.cpp.o
[ 62%] Linking CXX executable convert_utf16be_to_latin1_tests
[ 63%] Linking CXX executable convert_utf8_to_utf32_with_errors_tests
[ 63%] Linking CXX executable convert_utf8_to_utf16be_with_errors_tests
[ 63%] Linking CXX executable convert_valid_utf16le_to_latin1_tests
[ 63%] Built target convert_utf16le_to_latin1_tests_with_errors
[ 64%] Building CXX object tests/CMakeFiles/convert_utf16le_to_utf8_with_errors_tests.dir/convert_utf16le_to_utf8_with_errors_tests.cpp.o
[ 64%] Built target convert_utf16be_to_latin1_tests
[ 65%] Building CXX object tests/CMakeFiles/convert_utf16be_to_utf8_with_errors_tests.dir/convert_utf16be_to_utf8_with_errors_tests.cpp.o
[ 65%] Built target convert_utf8_to_utf32_with_errors_tests
[ 65%] Built target convert_utf8_to_utf16be_with_errors_tests
[ 66%] Building CXX object tests/CMakeFiles/convert_utf32_to_latin1_tests.dir/convert_utf32_to_latin1_tests.cpp.o
[ 66%] Built target convert_valid_utf16le_to_latin1_tests
[ 67%] Linking CXX executable convert_utf16be_to_latin1_tests_with_errors
[ 67%] Building CXX object tests/CMakeFiles/convert_valid_utf32_to_latin1_tests.dir/convert_valid_utf32_to_latin1_tests.cpp.o
[ 67%] Building CXX object tests/CMakeFiles/convert_utf32_to_latin1_with_errors_tests.dir/convert_utf32_to_latin1_with_errors_tests.cpp.o
[ 67%] Built target convert_utf16be_to_latin1_tests_with_errors
[ 68%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf8_tests.dir/convert_utf32_to_utf8_tests.cpp.o
[ 69%] Linking CXX executable convert_valid_utf16be_to_latin1_tests
[ 69%] Built target convert_valid_utf16be_to_latin1_tests
[ 70%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf8_with_errors_tests.dir/convert_utf32_to_utf8_with_errors_tests.cpp.o
[ 71%] Linking CXX executable convert_valid_utf32_to_latin1_tests
[ 72%] Linking CXX executable convert_utf32_to_latin1_with_errors_tests
[ 72%] Built target convert_valid_utf32_to_latin1_tests
[ 73%] Linking CXX executable convert_utf32_to_latin1_tests
[ 74%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf16le_tests.dir/convert_utf32_to_utf16le_tests.cpp.o
[ 74%] Built target convert_utf32_to_latin1_with_errors_tests
[ 74%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf16be_tests.dir/convert_utf32_to_utf16be_tests.cpp.o
[ 74%] Built target convert_utf32_to_latin1_tests
[ 75%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf16le_with_errors_tests.dir/convert_utf32_to_utf16le_with_errors_tests.cpp.o
[ 75%] Linking CXX executable convert_utf16le_to_utf8_with_errors_tests
[ 75%] Linking CXX executable convert_utf16le_to_utf8_tests
[ 75%] Linking CXX executable convert_utf32_to_utf8_tests
[ 75%] Built target convert_utf16le_to_utf8_with_errors_tests
[ 75%] Building CXX object tests/CMakeFiles/convert_utf32_to_utf16be_with_errors_tests.dir/convert_utf32_to_utf16be_with_errors_tests.cpp.o
[ 75%] Built target convert_utf16le_to_utf8_tests
[ 75%] Building CXX object tests/CMakeFiles/convert_valid_utf16le_to_utf8_tests.dir/convert_valid_utf16le_to_utf8_tests.cpp.o
[ 75%] Built target convert_utf32_to_utf8_tests
[ 75%] Linking CXX executable convert_utf16be_to_utf8_with_errors_tests
[ 76%] Building CXX object tests/CMakeFiles/convert_valid_utf16be_to_utf8_tests.dir/convert_valid_utf16be_to_utf8_tests.cpp.o
[ 76%] Built target convert_utf16be_to_utf8_with_errors_tests
[ 76%] Linking CXX executable convert_utf16be_to_utf8_tests
[ 77%] Building CXX object tests/CMakeFiles/convert_valid_utf32_to_utf8_tests.dir/convert_valid_utf32_to_utf8_tests.cpp.o
[ 77%] Built target convert_utf16be_to_utf8_tests
[ 78%] Building CXX object tests/CMakeFiles/convert_valid_utf32_to_utf16le_tests.dir/convert_valid_utf32_to_utf16le_tests.cpp.o
[ 79%] Linking CXX executable convert_utf32_to_utf8_with_errors_tests
[ 79%] Built target convert_utf32_to_utf8_with_errors_tests
[ 80%] Building CXX object tests/CMakeFiles/convert_valid_utf32_to_utf16be_tests.dir/convert_valid_utf32_to_utf16be_tests.cpp.o
[ 80%] Linking CXX executable convert_utf32_to_utf16le_tests
[ 81%] Linking CXX executable convert_utf32_to_utf16be_tests
[ 81%] Linking CXX executable convert_utf32_to_utf16le_with_errors_tests
[ 81%] Built target convert_utf32_to_utf16le_tests
[ 81%] Building CXX object tests/CMakeFiles/convert_utf16le_to_utf32_tests.dir/convert_utf16le_to_utf32_tests.cpp.o
[ 81%] Built target convert_utf32_to_utf16be_tests
[ 81%] Built target convert_utf32_to_utf16le_with_errors_tests
[ 81%] Building CXX object tests/CMakeFiles/convert_utf16be_to_utf32_tests.dir/convert_utf16be_to_utf32_tests.cpp.o
[ 81%] Building CXX object tests/CMakeFiles/convert_utf16le_to_utf32_with_errors_tests.dir/convert_utf16le_to_utf32_with_errors_tests.cpp.o
[ 82%] Linking CXX executable convert_utf32_to_utf16be_with_errors_tests
[ 82%] Linking CXX executable convert_valid_utf32_to_utf8_tests
[ 82%] Linking CXX executable convert_valid_utf32_to_utf16le_tests
[ 82%] Built target convert_utf32_to_utf16be_with_errors_tests
[ 83%] Building CXX object tests/CMakeFiles/convert_utf16be_to_utf32_with_errors_tests.dir/convert_utf16be_to_utf32_with_errors_tests.cpp.o
[ 83%] Built target convert_valid_utf32_to_utf8_tests
[ 84%] Building CXX object tests/CMakeFiles/convert_valid_utf16le_to_utf32_tests.dir/convert_valid_utf16le_to_utf32_tests.cpp.o
[ 84%] Built target convert_valid_utf32_to_utf16le_tests
[ 85%] Building CXX object tests/CMakeFiles/convert_valid_utf16be_to_utf32_tests.dir/convert_valid_utf16be_to_utf32_tests.cpp.o
[ 86%] Linking CXX executable convert_valid_utf16le_to_utf8_tests
[ 86%] Linking CXX executable convert_valid_utf16be_to_utf8_tests
[ 86%] Built target convert_valid_utf16le_to_utf8_tests
[ 87%] Building CXX object tests/CMakeFiles/count_utf8.dir/count_utf8.cpp.o
[ 87%] Linking CXX executable convert_valid_utf32_to_utf16be_tests
[ 87%] Built target convert_valid_utf16be_to_utf8_tests
[ 88%] Building CXX object tests/CMakeFiles/count_utf16le.dir/count_utf16le.cpp.o
[ 88%] Built target convert_valid_utf32_to_utf16be_tests
[ 89%] Building CXX object tests/CMakeFiles/count_utf16be.dir/count_utf16be.cpp.o
[ 90%] Linking CXX executable convert_utf16le_to_utf32_with_errors_tests
[ 90%] Built target convert_utf16le_to_utf32_with_errors_tests
[ 90%] Building CXX object tests/CMakeFiles/detect_encodings_tests.dir/detect_encodings_tests.cpp.o
[ 91%] Linking CXX executable convert_utf16le_to_utf32_tests
[ 91%] Built target convert_utf16le_to_utf32_tests
[ 92%] Building CXX object tests/CMakeFiles/basic_fuzzer.dir/basic_fuzzer.cpp.o
[ 93%] Linking CXX executable count_utf8
[ 94%] Linking CXX executable convert_utf16be_to_utf32_tests
[ 94%] Built target count_utf8
[ 95%] Building CXX object benchmarks/src/CMakeFiles/simdutf_benchmarks_benchmark.dir/benchmark_base.cpp.o
[ 95%] Linking CXX executable count_utf16le
[ 96%] Linking CXX executable convert_valid_utf16le_to_utf32_tests
[ 96%] Linking CXX executable convert_utf16be_to_utf32_with_errors_tests
[ 96%] Built target convert_utf16be_to_utf32_tests
[ 96%] Building CXX object benchmarks/src/CMakeFiles/simdutf_benchmarks_benchmark.dir/cmdline.cpp.o
[ 96%] Linking CXX executable convert_valid_utf16be_to_utf32_tests
[ 96%] Built target count_utf16le
[ 96%] Built target convert_valid_utf16le_to_utf32_tests
[ 97%] Building CXX object benchmarks/src/CMakeFiles/simdutf_benchmarks_benchmark.dir/benchmark.cpp.o
[ 97%] Built target convert_utf16be_to_utf32_with_errors_tests
[ 97%] Built target convert_valid_utf16be_to_utf32_tests
[ 97%] Linking CXX executable count_utf16be
[ 97%] Built target count_utf16be
[ 98%] Linking CXX executable detect_encodings_tests
[ 98%] Built target detect_encodings_tests
[ 99%] Linking CXX executable basic_fuzzer
[ 99%] Built target basic_fuzzer
[ 99%] Linking CXX executable amalgamation_demo
[ 99%] Built target amalgamation_demo
[ 99%] Linking CXX static library libsimdutf_benchmarks_benchmark.a
[ 99%] Built target simdutf_benchmarks_benchmark
[ 99%] Building CXX object benchmarks/CMakeFiles/benchmark.dir/benchmark.cpp.o
[100%] Linking CXX executable benchmark
[100%] Built target benchmark
[root@c0315012778a build]# 

kevincai avatar Aug 29 '23 14:08 kevincai

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

github-actions[bot] avatar Feb 26 '24 11:02 github-actions[bot]

Note that simdutf is used by Node.js and Bun, and various other systems, in production... and it has been used in production for several years.

We also have fast base64 decoding functions (they are what Node.js uses to decode base64) as part of simdutf now!!!

lemire avatar Jun 18 '24 12:06 lemire

validate_utf8+haswell, input size: 4979846, iterations: 100, dataset: all-in1.txt
   8.765 GB/s (0.8 %)    7.307 Gc/s     1.20 byte/char 
validate_utf8_sr+haswell, input size: 4979846, iterations: 100, dataset: all-in1.txt
   1.069 GB/s (1.3 %)    0.891 Gc/s     1.20 byte/char 
count_utf8+haswell, input size: 4979846, iterations: 1000, dataset: all-in1.txt
  25.277 GB/s (1.2 %)   21.073 Gc/s     1.20 byte/char 
count_utf8_sr+haswell, input size: 4979846, iterations: 1000, dataset: all-in1.txt
   7.643 GB/s (0.4 %)    6.372 Gc/s     1.20 byte/char 

It's around 3x~8x times boost compared to current implementation in SR.

kevincai avatar Jun 18 '24 22:06 kevincai

It's around 7x~8x times boost compared to current implementation in SR.

Interesting.

lemire avatar Jun 18 '24 22:06 lemire

It's around 7x~8x times boost compared to current implementation in SR.

Interesting.

@lemire does it match your expectation?

kevincai avatar Jun 18 '24 22:06 kevincai

@kevincai Indeed. It is within my expectations... meaning that I expect that your results are correct. Yet these are good results.

lemire avatar Jun 18 '24 22:06 lemire