fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

simdutf: simdutf_connector: in_tail: Implement UTF-16LE/UTF-16BE encoder

Open cosmo0920 opened this issue 1 year ago • 4 comments

In Windows, there are lots of using UTF-16LE programs. This is because Unicode on Windows means UTF-16LE with BOM(Byte Order Mark). In addition, there is lots of differences between UTF-16LE/UTF-16BE and UTF-8. I added some of C, J and subdivision flags test cases for converting from UTF-16LE/UTF-16BE to UTF-8 in unit tests for in_tail plugin. This is because in_tail is the main usages to process non-UTF-8 encodings. At first, we need to process UTF-16LE and UTF-16BE encodings.

Note that simdutf library is written in C++. So, we also provide an option (FLB_UNICODE_ENCODER) to turn on/off this feature.

Closes https://github.com/fluent/fluent-bit/issues/9321


Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [x] Example configuration file for the change
[SERVICE]
   flush           1
   log_level       trace

[INPUT]
   Name              tail
   Path              <path/to/non-UTF-8_encoded_file.log>
   Read_from_Head    True
   Unicode.Encoding  auto

[OUTPUT]
   Name  stdout
   Match *
  • [x] Debug log output from testing the change
Fluent Bit v3.2.3
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _           _____  _____ 
|  ___| |                | |   | ___ (_) |         |____ |/ __  \
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`' / /'
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \  / /  
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /./ /___
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)_____/


[2024/12/19 14:42:52] [ info] Configuration:
[2024/12/19 14:42:52] [ info]  flush time     | 1.000000 seconds
[2024/12/19 14:42:52] [ info]  grace          | 5 seconds
[2024/12/19 14:42:52] [ info]  daemon         | 0
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  inputs:
[2024/12/19 14:42:52] [ info]      tail
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  filters:
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  outputs:
[2024/12/19 14:42:52] [ info]      stdout.0
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  collectors:
[2024/12/19 14:42:52] [ info] [fluent bit] version=3.2.3, commit=de5ee981a2, pid=1225646
[2024/12/19 14:42:52] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/12/19 14:42:52] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/12/19 14:42:52] [ info] [simd    ] SSE2
[2024/12/19 14:42:52] [ info] [cmetrics] version=0.9.9
[2024/12/19 14:42:52] [ info] [ctraces ] version=0.5.7
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] initializing
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/12/19 14:42:52] [debug] [tail:tail.0] created event channels: read=25 write=26
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] adjusted buf_max_size to 4001
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] adjusted buf_chunk_size to 4001
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inotify watch fd=31
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170377 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log, inode 43170377
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log'
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170323 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [ info] [output:stdout:stdout.0] worker #0 started
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log, inode 43170323
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log'
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170324 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log, inode 43170324
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log'
[2024/12/19 14:42:52] [debug] [stdout:stdout.0] created event channels: read=35 write=36
[2024/12/19 14:42:52] [ info] [sp] stream processor started
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=123, records=1, input=tail.0
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=109, records=1, input=tail.0
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=196, records=1, input=tail.0
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] [static files] processed 290b
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170377 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170377 watch_fd=1 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170323 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170323 watch_fd=2 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170324 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170324 watch_fd=3 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [task 0x6177b10] created (id=0)
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [debug] [task] created task=0x6177b10 id=0 OK
[2024/12/19 14:42:52] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] sqlerrorlog: [[1734586972.362419405, {}], {"log"=>"🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁷󠁬󠁳󠁿"}]
[1] sqlerrorlog: [[1734586972.388064603, {}], {"log"=>"用汉字在 Fluent Bit 中处理日志,就像是一个梦一样😀"}]
[2] sqlerrorlog: [[1734586972.389956708, {}], {"log"=>"にほんごテストログふぁいる。文字エンコーディングをUnicodeにできる!?☕😀⚪⚫🔴🔵🟠🟡🟢🟣🟤🇺🇸🇯🇵"}]
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [debug] [out flush] cb_destroy coro_id=0
[2024/12/19 14:42:52] [trace] [coro] destroy coroutine=0x6177db0 data=0x6177dd0
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2024/12/19 14:42:52] [debug] [task] destroy task=0x6177b10 (task_id=0)
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
^C[2024/12/19 14:42:53] [engine] caught signal (SIGINT)
[2024/12/19 14:42:53] [trace] [engine] flush enqueued data
[2024/12/19 14:42:53] [ warn] [engine] service will shutdown in max 5 seconds
[2024/12/19 14:42:53] [ info] [input] pausing tail.0
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [ info] [engine] service has stopped (0 pending tasks)
[2024/12/19 14:42:53] [ info] [input] pausing tail.0
[2024/12/19 14:42:53] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170377 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:53] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170377 watch_fd=1
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170323 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:53] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170323 watch_fd=2
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170324 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
  • [x] Attached Valgrind output that shows no leaks or memory corruption was found
==1225646== 
==1225646== HEAP SUMMARY:
==1225646==     in use at exit: 0 bytes in 0 blocks
==1225646==   total heap usage: 3,463 allocs, 3,463 frees, 1,050,521 bytes allocated
==1225646== 
==1225646== All heap blocks were freed -- no leaks are possible
==1225646== 
==1225646== For lists of detected and suppressed errors, rerun with: -s
==1225646== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [ ] Run local packaging test showing all targets (including any new ones) build.
  • [ ] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [x] Documentation required for this feature

https://github.com/fluent/fluent-bit-docs/pull/1471

Backporting

  • [ ] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

cosmo0920 avatar Oct 07 '24 06:10 cosmo0920

Ah so this means we compile a C++ library as well - I think @pwhelan was looking at something similar for other support.

I was looking at it for generic runtime protobuf support.

pwhelan avatar Oct 10 '24 12:10 pwhelan

@cosmo0920 for normal use cases, are there any performance gains ?

edsiper avatar Oct 20 '24 19:10 edsiper

@cosmo0920 for normal use cases, are there any performance gains ?

Unfortunately, in the normal use cases, almost always Fluent Bit handles ingested UTF-8 encoded strings. However, for the ASCII only UTF-16LE/BE encoded bytes, SIMDUTF provides a short-circuit to detect ASCII-only bytes even if no vectorized environments (they say scalar environment for it):

  while (pos < len) {
    // check of the next 16 bytes are ascii.
    uint64_t next_pos = pos + 16;
    if (next_pos <= len) { // if it is safe to read 16 more bytes, check that they are ascii
      uint64_t v1;
      std::memcpy(&v1, data + pos, sizeof(uint64_t));
      uint64_t v2;
      std::memcpy(&v2, data + pos + sizeof(uint64_t), sizeof(uint64_t));
      uint64_t v{v1 | v2};
      if ((v & 0x8080808080808080) == 0) {
        pos = next_pos;
        continue;
      }
    }

https://github.com/fluent/fluent-bit/blob/7f7dcbf8a0f106294d4c2d0683f1d8c69a865bd7/lib/simdutf-amalgamation-5.5.0/src/simdutf/simdutf.cpp#L5178

Plus, SIMDUTF uses AVX512, AVX2 instructions for improving performances inside their inplementations.

cosmo0920 avatar Oct 21 '24 08:10 cosmo0920

Need to merge #9751 to resolve packaging issues on Github runners.

patrick-stephens avatar Dec 19 '24 11:12 patrick-stephens

It's seem's that PR is freezed ?

tguenneguez avatar Jan 14 '25 18:01 tguenneguez

It's seem's that PR is freezed ?

It's just postponed to be merged.

cosmo0920 avatar Jan 15 '25 01:01 cosmo0920

Do you have any visibility on the agent version that will integrate this evolution ?

tguenneguez avatar Jan 15 '25 08:01 tguenneguez

Do you have any visibility on the agent version that will integrate this evolution ?

master is currently targeting 4.0 release: https://github.com/fluent/fluent-bit/wiki/Fluent-Bit-Roadmap

patrick-stephens avatar Jan 15 '25 10:01 patrick-stephens

Do you plan to implement it in inputs/head ?

tguenneguez avatar Mar 03 '25 15:03 tguenneguez

@cosmo0920 is this ready to go ?

edsiper avatar Mar 18 '25 18:03 edsiper

Yes, it's ready to go. I've rebased off master recently.

cosmo0920 avatar Mar 22 '25 02:03 cosmo0920