in_tail: encoding option does not work
Describe the bug
When trying a regex parse, I couldn't match the plus-minus unicode character (±) I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.
All those failed to match. ±, \uc2b1, \xc2\xb1 In despair, I'm matching "unknown character" for now (\uFFFD).
In fluentd docker image, I enabled -vv to get some more insight and discovered the following:
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main: cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]
With fluent-package:
# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor
Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?
Thank you.
To Reproduce
Run a regex matching unicode.
Expected behavior
Should be handling utf-8 by default but this could be subject to taste. Otherwise, document it (Have I missed it?) and provide an easy way to override it.
Your Environment
- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0
Your Configuration
# posfile is disabled on purpose here.
<source>
@type tail
path /var/log/somelogs/*.log
tag sometag
encoding UTF-8
<parse>
@type regexp
expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/
time_format %s.%N
time_key timestamp_unix_and_us
time_type string
keep_time_key true
</parse>
</source>
Your Error Log
2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"
Additional context
Here's a log line:
Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±
Here's my source:
# posfile is disabled on purpose here.
<source>
@type tail
path /var/log/somelogs/*.log
tag sometag
encoding UTF-8
<parse>
@type regexp
expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/
time_format %s.%N
time_key timestamp_unix_and_us
time_type string
keep_time_key true
</parse>
</source>
Here's the error I get:
2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"
If I replace the expression with this, it goes through and is properly split:
expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/
\uFFFD is "unrecognized character".
My fluentd.conf is using UTF-8 My log file is using UTF-8 My config, as you can see above is UTF-8 The only thing I see that is not UTF-8 is the ruby argument defining encoding.
So even when simplifying things a lot I don't see a proper handling of unicode.
I tried ±, \±, \uc2b1, \xc2\xb1
@pierreact Thanks for your report! We will check this soon.
@pierreact We found that, at least as a Ruby specification, Encoding::CompatibilityError occurs if the target data and the regular expression encoding do not match.
In Fluentd, if we specify a regular expression that contains non-ASCII characters in the parser, this can be a problem. But it looks like it should work fine since you have specified utf-8 for the encoding of in_tail.
We will investigate the correct use of regular expressions containing non-ASCII characters in the parser.
def test_match(target_data, regexp_string)
puts "=== test_match ==="
puts "target_data: #{target_data} (#{target_data.encoding})"
puts "regexp_string: #{regexp_string} (#{regexp_string.encoding})"
result = /#{regexp_string}/.match(target_data)
puts "Success!"
p result
rescue => e
puts "Failed!"
p e
end
test_match(
"あいうえお".force_encoding(Encoding::UTF_8),
"う".force_encoding(Encoding::UTF_8)
)
test_match(
"あいうえお".force_encoding(Encoding::ASCII_8BIT),
"う".force_encoding(Encoding::UTF_8)
)
test_match(
"あいうえお".force_encoding(Encoding::UTF_8),
"う".force_encoding(Encoding::ASCII_8BIT)
)
test_match(
"あいうえお".force_encoding(Encoding::ASCII_8BIT),
"う".force_encoding(Encoding::ASCII_8BIT)
)
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (UTF-8)
Success!
#<MatchData "う">
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (UTF-8)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (ASCII-8BIT)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)>
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (ASCII-8BIT)
Success!
#<MatchData "\xE3\x81\x86">
The hex code of ± should have U+00B1 as UTF8.
Ref. https://www.unicode.org/charts/PDF/U0080.pdf
However, the error log indicates \uFFFD\uFFFD. 🤔
When I debug the value retrived by IO#readpartial,
https://github.com/fluent/fluentd/blob/61d7a8efcce8553a1da45f0968a4570d7bb85d62/lib/fluent/plugin/in_tail.rb#L1214
it shows "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\xC2\xB1INFO\xC2\xB1ABCD\xC2\xB1EFGH\xC2\xB1MYTAG\xC2\xB1HERE IS SOME MESSAGE\xC2\xB1\n" and it is correct.
At convert(rbuf), it returns "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD\n"
https://github.com/fluent/fluentd/blob/61d7a8efcce8553a1da45f0968a4570d7bb85d62/lib/fluent/plugin/in_tail.rb#L1081
I think you should set from_encoding UTF-8 at in_tail configuration.
When I add from_encoding UTF-8, your issue has been solved on my environment.
<source>
@type tail
path "#{File.expand_path '~/tmp/access*.log'}"
pos_file "#{File.expand_path '~/tmp/fluentd/access.log.pos'}"
tag sometag
from_encoding UTF-8 ## add
encoding UTF-8
<parse>
@type regexp
expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/
time_format %s.%N
time_key timestamp_unix_and_us
time_type string
keep_time_key true
</parse>
</source>
<match **>
@type stdout
</match>
@pierreact Please add from_encoding UTF-8 in your config.
Feel free to re-open if you have anything. Thanks.
Thanks @pierreact, @Watson1978 !
Looks like this is a general issue of Fluentd:
- parser_regexp: cannot use non-ascii characters for regexp without setting encoding
(There seems to be a similar problem with regular expressions in Fluentd, not just parser_regexp...)
For in_tail, as @Watson1978 says, we need to set both encoding and from_encoding to avoid this issue, although it is not intended for in_tail.
A more common solution is to specify the encoding to utf-8 before passing the data to the parser, as follows:
(using fluent-plugin-record-modifier)
(To set the encoding before the parser, use filter_parser instead of parsing in the input plugin)
<source>
@type tail
path /path/to/file
tag test
read_from_head
<parse>
@type none
</parse>
</source>
<filter test.**>
@type record_modifier
char_encoding utf-8
</filter>
<filter test.**>
@type parser
key_name message
<parse>
@type regexp
expression /(regexp including non-ascii)/
</parse>
</filter>
<match test.**>
@type stdout
</match>
Note: There may be a problem with Fluentd's specification that assumes utf-8 when using non-ascii characters in regular expressions.
Reopen.
This can be a bug in in_tail.
https://docs.fluentd.org/input/tail#encoding-from_encoding
If encoding is specified, in_tail changes string to encoding.
Looks like this specification is broken since
- #1409
I have confirmed this is a bug in in_tail.
Since Fluentd v0.14.12 (#1409), encoding option does not work correctly.
Reproduce
<source>
@type tail
tag test
path /test/fluentd/case/in_tail_encode/test.log
read_from_head
encoding utf-8
<parse>
@type regexp
expression /^あ(?<name>.*)お$/
</parse>
</source>
<match test.**>
@type stdout
</match>
/test/fluentd/case/in_tail_encode/test.log (UTF-8)
あいうえお
Result
...
2025-06-10 14:06:27 +0900 [info]: #0 starting fluentd worker pid=199409 ppid=199389 worker=0
2025-06-10 14:06:27 +0900 [info]: #0 following tail of /test/fluentd/case/in_tail_encode/test.log
2025-06-10 14:06:27 +0900 [warn]: #0 pattern not matched: "\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"
...
Expected
2025-06-10 14:15:06 +0900 [info]: #0 starting fluentd worker pid=202451 ppid=202431 worker=0
2025-06-10 14:15:06 +0900 [info]: #0 following tail of /test/fluentd/case/in_tail_encode/test.log
2025-06-10 14:15:06.132750877 +0900 test: {"name":"いうえ"}
Note
The bug is in v0.14.12.
It is too old, making it difficult to compare its behavior strictly.
I have compared by only checking out the in_tail.rb.
$ git checkout v0.14.11 -- lib/fluent/plugin/in_tail.rb
$ bundle exec fluentd -c test.conf
Then, I can get the expected result.