fluent-bit cmake: fix UNICODE-escaped characters on aarch64

Build with -fsigned-char on aarch64 to resolve issue: https://github.com/fluent/fluent-bit/issues/8521

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

May 22 '24 01:05 RamaMalladiAWS

Please sort the DCO failure, it cannot be merged without that.

May 22 '24 10:05 patrick-stephens

Please sort the DCO failure, it cannot be merged without that.

Done @patrick-stephens . Thanks

May 22 '24 23:05 RamaMalladiAWS

Let me check on this.

May 23 '24 06:05 cosmo0920

This patch makes to be able to process emoji on aarch64 machine with the posted repro in #8521.

% cat utf8.log
🌈
% bin/fluent-bit -i tail -p read_from_head=true -p path=utf8.log -o file -p file=output.log --verbose 
^C
% cat output.log
tail.0: [1716448524.105461223, {"log":"🌈"}]
% uname -m
aarch64

May 23 '24 07:05 cosmo0920

Please let me know if any further action required to merge this PR. I do see some sporadic unittest failures for [run-macos-unit-tests] which don't seem to be related to this code change. Thank you.

May 23 '24 11:05 RamaMalladiAWS

are there any other potential side effects of this change ?

I could be wrong, but it seems to me the problem is in another side...

May 23 '24 18:05 edsiper

are there any other potential side effects of this change ?

I could be wrong, but it seems to me the problem is in another side...

Some of the details are posted on Arm website: unsigned-char-and-signed-char. So, we have 2 solutions: Fix the char usage or the compiler flag -fsigned-char can be used.

For example, the PR https://github.com/fluent/fluent-bit/pull/3522 tries char usage fix.

May 23 '24 18:05 RamaMalladiAWS

From the ARM website, this could be caused by compatibility against the old ARM instructions:

The ANSI C standard specifies a range for both signed (at least -127 to +127) and unsigned (at least 0 to 255) chars. Simple chars are not specifically defined and it is compiler dependent whether they are signed or unsigned. Although the ARM architecture has the LDRSB instruction, that loads a signed byte into a 32-bit register with sign extension, the earliest versions of the architecture did not. It made sense at the time for the compiler to treat simple chars as unsigned, whereas on the x86 simple chars are, by default, treated as signed.

Also, it is mentioned as a workaround:

One workaround for users of GCC is to use the -fsigned-char command line switch or --signed-chars for RVCT, that forces all chars to become signed, but a better practice is to write portable code by declaring char variables appropriately. Unsigned char must be used for accessing memory as a block of bytes or for small unsigned integers. Signed char must be used for small signed integers and simple char must be used only for ASCII characters and strings. In fact, on an ARM core, it is usually better to use ints rather than chars, even for small values, for performance reasons. You can read more on this in Optimizing Code to Run on ARM Processors.

ref: https://developer.arm.com/documentation/den0013/d/Porting/Miscellaneous-C-porting-issues/unsigned-char-and-signed-char

However, my two cents, handling as signed char should align the handlings of char type like as x86 processors. Or, maybe we have to introduce an option to turn on this signed char by users?

May 27 '24 07:05 cosmo0920

@cosmo0920 , thank you for the review. @niedbalski , @fujimotos and @celalettin1286 , can you please review this PR?

I would like to merge this PR or make any additional changes required. Thanks again.

May 30 '24 16:05 RamaMalladiAWS

@cosmo0920 , thank you for the review. @niedbalski , @fujimotos and @celalettin1286 , can you please review this PR?

I would like to merge this PR or make any additional changes required. Thanks again.

Checking again... Thanks

Jun 05 '24 11:06 RamaMalladiAWS

How about adding a new option only for Linux on arm architectures?

diff --git a/CMakeLists.txt b/CMakeLists.txt
index b3e7a2585..09807f595 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -30,6 +30,10 @@ endif()
 if(CMAKE_SYSTEM_NAME MATCHES "Linux")
   set(FLB_SYSTEM_LINUX On)
   add_definitions(-DFLB_SYSTEM_LINUX)
+  if (CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm64|aarch64)")
+    set(FLB_LINUX_ON_ARM On)
+    add_definitions(-DFLB_LINUX_ON_ARM)
+  endif()
 endif()
 
 # Update CFLAGS
@@ -146,6 +150,9 @@ option(FLB_WINDOWS_DEFAULTS    "Build with predefined Windows settings"     Yes)
 option(FLB_WASM                "Build with WASM runtime support"            Yes)
 option(FLB_WAMRC               "Build with WASM AOT compiler executable"    No)
 option(FLB_WASM_STACK_PROTECT  "Build with WASM runtime with strong stack protector flags" No)
+if (FLB_LINUX_ON_ARM)
+  option(FLB_PREFER_SIGNED_CHAR  "Build with signed char (Linux on ARM only)"     No)
+endif()
 
 # Native Metrics Support (cmetrics)
 option(FLB_METRICS             "Enable metrics support"       Yes)
@@ -405,6 +412,14 @@ if (FLB_SYSTEM_LINUX)
   include(cmake/s390x.cmake)
 endif ()
 
+# Enable signed char support on Linux ARM if specified
+if (FLB_LINUX_ON_ARM)
+  if (FLB_PREFER_SIGNED_CHAR)
+    set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsigned-char")
+    message(STATUS "Enabling signed char")
+  endif()
+endif()
+
 # Extract Git commit information for debug output.
 # Note that this is only set when cmake is run, the intent here is to use in CI for verification of releases so is acce
ptable.
 # For a better solution see https://jonathanhamberg.com/post/cmake-embedding-git-hash/ but this is simple and easy.

This could be reasonable for the most cases and the problem point on ARM is: LDRSB instruction is not always existing in ARM processors. So, at this moment, providing only for ARM option would be reasonable to add, I guess.

Thanks @cosmo0920. So, option FLB_PREFER_SIGNED_CHAR set to Yes resolves the issue on arm64/ aarch64.

Are there any other issues blocking merge of this PR? Thank you.

Jun 11 '24 12:06 RamaMalladiAWS

@cosmo0920, I validated the change you suggested in https://github.com/fluent/fluent-bit/pull/8851#pullrequestreview-2110095229 and pushed https://github.com/fluent/fluent-bit/pull/8851/commits/253254ff9b178ab2c5c89d1185ab461ca3a2ee94.

Jun 11 '24 15:06 RamaMalladiAWS

Thanks @cosmo0920 for approving the changes. I see all check passed too. So, we would need 1 more review approval for merging this PR? Can you help me with the same? Thanks again.

Jun 12 '24 12:06 RamaMalladiAWS

@edsiper I think this is down to your preference here. I'm not sure if we should just have the flag regardless but only set it on ARM just for future usage on other platforms?

Jun 14 '24 13:06 patrick-stephens

The problematic behavior is only for ARM and ARM64 currently. As far as I know, RISC-V 64bit is another platform to decide unsigned char is the default behavior of char type.

This unsigned char by default behavior is for historical reasons. Because the earlier ARM processor does not have LDRSB instruction in their processors.

Unsigned char must be used for accessing memory as a block of bytes or for small unsigned integers. Signed char must be used for small signed integers and simple char must be used only for ASCII characters and strings.

ref: https://developer.arm.com/documentation/den0013/d/Porting/Miscellaneous-C-porting-issues/unsigned-char-and-signed-char

In fact, on an ARM core, it is usually better to use ints rather than chars, even for small values, for performance reasons.

If we use char as for the small regions to load/store values, we need to adopt these usages. However, without specifying the -fsigned-char, the emoji set of Unicode should be broken in ARM/ARM64 environments. Also, macOS Apple Silicon uses signed char when just specifying char in our source code.

nitpick: Another signed-char case is already existing that is for x390s: https://github.com/fluent/fluent-bit/blob/5ce9c43bcafa60dba38088fb76d67b9a0619525c/cmake/s390x.cmake#L3

Jun 16 '24 15:06 cosmo0920

Shouldn't we handle all these build flags automatically ? I mean, why to pass an option for the user ? I think this is a very low level thing.

How other projects are handling this ?

Jun 25 '24 02:06 edsiper

Shouldn't we handle all these build flags automatically ? I mean, why to pass an option for the user ? I think this is a very low level thing.

How other projects are handling this ?

Yes, we could pass this option -fsigned-char automatically on Arm processor platform discovery: (CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm64|aarch64)"). That was the first change proposed in this PR. Then we added an extra check for FLB_PREFER_SIGNED_CHAR giving the user an option to use -fsigned-char in the build.

Some of the other projects that I checked handle this by using -fsigned-char option in their builds for arm64|aarch64.

Jun 25 '24 12:06 RamaMalladiAWS

@edsiper, it would be great if this PR can be reviewed and approved. Let me know if further changes required. Thanks.

Jun 27 '24 19:06 RamaMalladiAWS

This is breaking emoji's in our logs for us, so we're keen to get this merged too.

Aug 15 '24 07:08 mm-zacharydavison

This is breaking Chinese characters in our logs for us, we are waiting eagle for it to get merged.

Aug 19 '24 02:08 ensean

This issue needs to be resolved ASAP. we really need support for unicode characters and emojis, regardless of whether it's x86 or ARM architecture.

Aug 19 '24 08:08 aleck31

I was unaware of this compiler dependent aspect but unless I'm missing something most if not all of us work under the expectation that char means signed char so I wonder if we shouldn't make this platform agnostic amongst supported compilers (gcc and clang?)

Side note : according to msdn, msvc already defaults char to be signed and user need to opt-in to make it unsigned so that one doesn't seem to be addressed and kind of makes me think that regardless of the platform we should adopt this to ensure that we have consistent behavior and expectations.

@patrick-stephens and @edsiper, could you please chime in?

I think it's a general recommendation to never rely on size or signed-ness of primitives in code, we should be explicit. Older compilers/standards particularly just pick an "optimal" type down to the underlying compiler/OS as to what the optimum is based on.

msvc may make it's own decisions but I'd say that's not the norm - plus they have other funny ideas if you want to start messing with default calling conventions too. We should be explicit either with a type definition or a compiler flag to force the behaviour we are expecting.

Aug 20 '24 12:08 patrick-stephens

From Apple documentation, macOS apple silicon (arm64 macOS) also selects signed char: https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms#Handle-data-types-and-data-alignment-properly

I found that there is still a linking issue for using signed char opt-in in RISC-V. So, we're only able to use opt-in feature of signed char in aarch64. However, we shouldn't forcibly use signed char for other non-x86 platforms.

Aug 21 '24 09:08 cosmo0920

My point is that given the assumptions made in our codebase it would make sense to make signed-char opt-out by default en every platform that allows it.

Aug 21 '24 12:08 leonardo-albertovich

@ensean would you please re-phrase what your previous message means ? is it a breaking change for Chinese support or you mean it also fixes a problem with Chinese characters ?

Aug 27 '24 14:08 edsiper

@ensean would you please re-phrase what your previous message means ? is it a breaking change for Chinese support or you mean it also fixes a problem with Chinese characters ?

Sorry for the confusion I make.

My case is that we have Chinese characters in the logs, when our App runs on x86 servers, the Chinese logs can be handled by fluent-bit correctly. In order to save cost, we are planning to change to arm based instances(Graviton on aws). During the PoC, everything works but we found that the Chinese logs can not be handled by fluent-bit correctly, for example 好雨知时节 is escaped as \u597d\u96e8\u77e5\u65f6\u8282, which is preventing us to adopt the arm based servers.

As more and more cloud providers are offering arm based servers for better price performance, I think there will be more fluent-bit users encounter this unicode-escaped issue. So I think it's necessary to get this PR merged.

Aug 27 '24 16:08 ensean

@ensean thanks for describing the current problem. With the proposed changes in this PR does the problem go away ?

Sep 05 '24 22:09 edsiper

@edsiper Yep, confirmed that this PR can solve my problem.

Sep 06 '24 02:09 ensean

@RamaMalladiAWS Hi, we added aarch64 CI task to confirm internal test cases should be exit normally on aarch64. So, with this workflow, we can use and confirm this PR's patch the following modification:

diff --git a/.github/workflows/unit-tests.yaml b/.github/workflows/unit-tests.yaml
index 0fa50f118..49593a696 100644
--- a/.github/workflows/unit-tests.yaml
+++ b/.github/workflows/unit-tests.yaml
@@ -125,7 +125,7 @@ jobs:
         config:
           - name: "Aarch64 actuated testing"
             flb_option: "-DFLB_WITHOUT_flb-it-network=1 -DFLB_WITHOUT_flb-it-fstore=1"
-            omit_option: "-DFLB_WITHOUT_flb-it-utils=1 -DFLB_WITHOUT_flb-it-pack=1"
+            omit_option: ""
             global_option: "-DFLB_BACKTRACE=Off -DFLB_SHARED_LIB=Off -DFLB_DEBUG=On -DFLB_ALL=On -DFLB_EXAMPLES=Off"
             unit_test_option: "-DFLB_TESTS_INTERNAL=On"
             compiler: gcc
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9e42d4faf..d1d6b33b5 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -30,6 +30,10 @@ endif()
 if(CMAKE_SYSTEM_NAME MATCHES "Linux")
   set(FLB_SYSTEM_LINUX On)
   add_definitions(-DFLB_SYSTEM_LINUX)
+  if (CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm64|aarch64)")
+    set(FLB_LINUX_ON_AARCH64 On)
+    add_definitions(-DFLB_LINUX_ON_AARCH64)
+  endif()
 endif()
 
 # Update CFLAGS
@@ -301,6 +305,12 @@ if (FLB_SYSTEM_LINUX)
   include(cmake/s390x.cmake)
 endif ()
 
+# Enable signed char support on Linux AARCH64 if specified
+if (FLB_LINUX_ON_AACH64)
+  set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsigned-char")
+  message(STATUS "Enabling signed char")
+endif()
+
 # Extract Git commit information for debug output.
 # Note that this is only set when cmake is run, the intent here is to use in CI for verification of releases so is acceptable.
 # For a better solution see https://jonathanhamberg.com/post/cmake-embedding-git-hash/ but this is simple and easy.

And could you rebase off the current master? This could help to confirm that characters which are having 2 or more bytes should be normally handled in aarch64 Linux.

Sep 06 '24 02:09 cosmo0920

@RamaMalladiAWS Hi, we added aarch64 CI task to confirm internal test cases should be exit normally on aarch64. So, with this workflow, we can use and confirm this PR's patch the following modification:

@cosmo0920 , I rebased my changes to the latest master and pushed. Would you want me to change the PR itself to the code you have here: https://github.com/fluent/fluent-bit/pull/8851#issuecomment-2333088177? Thanks

Sep 11 '24 17:09 RamaMalladiAWS

fluent-bit fluent-bit copied to clipboard

cmake: fix UNICODE-escaped characters on aarch64

fluent-bit
fluent-bit copied to clipboard