fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

in_tail: encoding option does not work

Open pierreact opened this issue 9 months ago • 8 comments

Describe the bug

When trying a regex parse, I couldn't match the plus-minus unicode character (±) I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.

All those failed to match. ±, \uc2b1, \xc2\xb1 In despair, I'm matching "unknown character" for now (\uFFFD).

In fluentd docker image, I enabled -vv to get some more insight and discovered the following:

2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]

With fluent-package:

# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor

Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?

Thank you.

To Reproduce

Run a regex matching unicode.

Expected behavior

Should be handling utf-8 by default but this could be subject to taste. Otherwise, document it (Have I missed it?) and provide an easy way to override it.

Your Environment

- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0

Your Configuration

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Your Error Log

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

Additional context

Here's a log line:

Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±

Here's my source:

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Here's the error I get:

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

If I replace the expression with this, it goes through and is properly split:

expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/

\uFFFD is "unrecognized character".

My fluentd.conf is using UTF-8 My log file is using UTF-8 My config, as you can see above is UTF-8 The only thing I see that is not UTF-8 is the ruby argument defining encoding.

So even when simplifying things a lot I don't see a proper handling of unicode.

I tried ±, \±, \uc2b1, \xc2\xb1

pierreact avatar Mar 11 '25 09:03 pierreact

@pierreact Thanks for your report! We will check this soon.

daipom avatar Mar 11 '25 09:03 daipom

@pierreact We found that, at least as a Ruby specification, Encoding::CompatibilityError occurs if the target data and the regular expression encoding do not match.

In Fluentd, if we specify a regular expression that contains non-ASCII characters in the parser, this can be a problem. But it looks like it should work fine since you have specified utf-8 for the encoding of in_tail.

We will investigate the correct use of regular expressions containing non-ASCII characters in the parser.

def test_match(target_data, regexp_string)
  puts "=== test_match ===" 
  puts "target_data: #{target_data} (#{target_data.encoding})"
  puts "regexp_string: #{regexp_string} (#{regexp_string.encoding})"
  result = /#{regexp_string}/.match(target_data)
  puts "Success!"
  p result
rescue => e
  puts "Failed!"
  p e
end

test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::ASCII_8BIT)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::ASCII_8BIT)
)
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (UTF-8)
Success!
#<MatchData "う">
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (UTF-8)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (ASCII-8BIT)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)>
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (ASCII-8BIT)
Success!
#<MatchData "\xE3\x81\x86">

daipom avatar Mar 12 '25 02:03 daipom

The hex code of ± should have U+00B1 as UTF8.

Image

Ref. https://www.unicode.org/charts/PDF/U0080.pdf

However, the error log indicates \uFFFD\uFFFD. 🤔

Image

When I debug the value retrived by IO#readpartial, https://github.com/fluent/fluentd/blob/61d7a8efcce8553a1da45f0968a4570d7bb85d62/lib/fluent/plugin/in_tail.rb#L1214

it shows "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\xC2\xB1INFO\xC2\xB1ABCD\xC2\xB1EFGH\xC2\xB1MYTAG\xC2\xB1HERE IS SOME MESSAGE\xC2\xB1\n" and it is correct.

Image

At convert(rbuf), it returns "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD\n" https://github.com/fluent/fluentd/blob/61d7a8efcce8553a1da45f0968a4570d7bb85d62/lib/fluent/plugin/in_tail.rb#L1081

Watson1978 avatar Mar 28 '25 05:03 Watson1978

I think you should set from_encoding UTF-8 at in_tail configuration.

When I add from_encoding UTF-8, your issue has been solved on my environment.

<source>
  @type tail

  path "#{File.expand_path '~/tmp/access*.log'}"
  pos_file "#{File.expand_path '~/tmp/fluentd/access.log.pos'}"

  tag sometag
  from_encoding UTF-8 ## add
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

<match **>
  @type stdout
</match>

Watson1978 avatar Mar 28 '25 06:03 Watson1978

@pierreact Please add from_encoding UTF-8 in your config.

Feel free to re-open if you have anything. Thanks.

Watson1978 avatar Mar 28 '25 06:03 Watson1978

Thanks @pierreact, @Watson1978 !

Looks like this is a general issue of Fluentd:

  • parser_regexp: cannot use non-ascii characters for regexp without setting encoding

(There seems to be a similar problem with regular expressions in Fluentd, not just parser_regexp...)

For in_tail, as @Watson1978 says, we need to set both encoding and from_encoding to avoid this issue, although it is not intended for in_tail.

A more common solution is to specify the encoding to utf-8 before passing the data to the parser, as follows: (using fluent-plugin-record-modifier) (To set the encoding before the parser, use filter_parser instead of parsing in the input plugin)

<source>
  @type tail
  path /path/to/file
  tag test
  read_from_head
  <parse>
    @type none
  </parse>
</source>

<filter test.**>
  @type record_modifier
  char_encoding utf-8
</filter>

<filter test.**>
  @type parser
  key_name message
  <parse>
    @type regexp
    expression /(regexp including non-ascii)/
  </parse>
</filter>

<match test.**>
  @type stdout
</match>

Note: There may be a problem with Fluentd's specification that assumes utf-8 when using non-ascii characters in regular expressions.

daipom avatar Mar 28 '25 07:03 daipom

Reopen. This can be a bug in in_tail.

https://docs.fluentd.org/input/tail#encoding-from_encoding

If encoding is specified, in_tail changes string to encoding.

Looks like this specification is broken since

  • #1409

daipom avatar Jun 10 '25 04:06 daipom

I have confirmed this is a bug in in_tail. Since Fluentd v0.14.12 (#1409), encoding option does not work correctly.

Reproduce

<source>
  @type tail
  tag test
  path /test/fluentd/case/in_tail_encode/test.log
  read_from_head
  encoding utf-8
  <parse>
    @type regexp
    expression /^あ(?<name>.*)お$/
  </parse>
</source>

<match test.**>
  @type stdout
</match>

/test/fluentd/case/in_tail_encode/test.log (UTF-8)

あいうえお

Result

...
2025-06-10 14:06:27 +0900 [info]: #0 starting fluentd worker pid=199409 ppid=199389 worker=0
2025-06-10 14:06:27 +0900 [info]: #0 following tail of /test/fluentd/case/in_tail_encode/test.log
2025-06-10 14:06:27 +0900 [warn]: #0 pattern not matched: "\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"
...

Expected

2025-06-10 14:15:06 +0900 [info]: #0 starting fluentd worker pid=202451 ppid=202431 worker=0
2025-06-10 14:15:06 +0900 [info]: #0 following tail of /test/fluentd/case/in_tail_encode/test.log
2025-06-10 14:15:06.132750877 +0900 test: {"name":"いうえ"}

Note

The bug is in v0.14.12. It is too old, making it difficult to compare its behavior strictly. I have compared by only checking out the in_tail.rb.

$ git checkout v0.14.11 -- lib/fluent/plugin/in_tail.rb
$ bundle exec fluentd -c test.conf

Then, I can get the expected result.

daipom avatar Jun 10 '25 05:06 daipom