fluent-bit
fluent-bit copied to clipboard
Filter nest/lift is broken on Windows in 2.1.10 and 2.2.0 =>logs corrupted
Bug Report
Describe the bug After upgrading from Fluent Bit 2.1.9 to 2.1.10 the nest/lift filter is no longer working and corrupts the logs
My case are Windows pods in k8s
Fluent Bit reads the log files and adds Kubernetes information. Then it nest/lift the Kubernetes fields
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
DNS_Retries 10
DNS_Wait_Time 30
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels On
Annotations Off
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix _kubernetes.
The result in cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.10 would look something like this:
{"time"=>"2023-11-06T14:04:33.8847011Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://art.staging.firmglobal.net/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"12622", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"23015ddcfebe4f7e9fc2d180318e0adb", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:04:33.863+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Host={art.staging.firmglobal.net}", "_kubernetes.pod_name�identity3-api-7c5c56447f-hb6rv�|_kubernetes.namespace_n"=>97, 109=>101, "horizons"=>"_kubernetes.pod_id�$09cc3c99-a4ac-4f54-a884-02bf196b5069�<_k", 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97}
As you can see there are some weird characters in the log after doing nest/lift. And also some weird 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97.
All this works fine when I revert the container image to cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.9
This is how the output looks in 2.1.9
{"time"=>"2023-11-06T14:34:58.7061335Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://10.241.74.235/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"34615", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"b3f7e7854ae041ce9bb3baaeeeade543", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:34:58.684+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Connection={close}, Accept={*/*}, Host={10.241.74.235:80}", "_kubernetes.pod_name"=>"identity3-api-7c5c56447f-jkhhf", "_kubernetes.namespace_name"=>"horizons", "_kubernetes.pod_id"=>"b56eef39-ac81-4efc-959f-bea7dd3b0193", "_kubernetes.labels"=>{"app"=>"identity3-api", "app.confirmit.com/template"=>"iis-app", "app.confirmit.com/template-version"=>"11.2.0", "app.kubernetes.io/instance"=>"horizons", "app.kubernetes.io/managed-by"=>"Helm", "app.kubernetes.io/name"=>"identity3-api", "app.kubernetes.io/version"=>"2.24.1753", "helm.sh/chart"=>"identity3-api-2.24.1753", "pod-template-hash"=>"7c5c56447f", "version"=>"2.24.1753"}, "_kubernetes.host"=>"akswinu7f0000t2", "_kubernetes.container_name"=>"identity3-api", "_kubernetes.docker_id"=>"0a4d2404db4070675163eac1bf4cc7062d82ffd54c71fadbc14045436c5cee24", "_kubernetes.container_hash"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api@sha256:0b6349dab35536371cc88385bacc33cfb1138ef9ddc99729bc6efb2a71870f9f", "_kubernetes.container_image"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api:2.24.1753"}
For Linux containers nest/lift works as before after the upgrade. But I need it working for for Windows pods also
Let me know if you need more information to reproduce the issue
anyone? :-)
Confirmed also broken on version 2.2.0. Last working version is 2.1.9
@cosmo0920 @nokute78 Hi. Just trying to tag some of you that contributed to the 2.1.10 release. Could you please take a look at this issue or tag someone that could know more about it. At the moment we are stuck with version 2.1.9 on the Windows nodes in our Kubernetes clusters. Both version 2.10 and 2.2.0 have this issue.
Let me know if I can provider more information to help out
@cosmo0920 I don't have build environment on windows, but I tested it. v2.1.10 also didn't work using following configuration.
[SERVICE]
Log_Level debug
[INPUT]
Name dummy
Dummy {"map_data":{"key":"value", "id":123}}
samples 1
[FILTER]
Name nest
Match *
Operation lift
Nested_under map_data
Add_prefix _map_data.
[OUTPUT]
name stdout
match *
It will fail cause of decoder error and new chunk size is different.
chunk io was updated from v2.1.10. Is it a cause of this issue ?
v2.1.10:
/11/23 09:40:09] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:40:09] [debug] [input chunk] update output instances with new chunk size diff=61, records=1, input=dummy.0
[2023/11/23 09:40:10] [debug] [task] created task=0000027358FB7D50 id=0 OK
[2023/11/23 09:40:10] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/11/23 09:40:10] [error] [output:stdout:stdout.0] Log event decoder error : -12
v2.1.9
[2023/11/23 09:38:41] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:38:41] [debug] [input chunk] update output instances with new chunk size diff=60, records=1, input=dummy.0
[2023/11/23 09:38:42] [debug] [task] created task=00000140C6B53EE0 id=0 OK
[0] dummy.0: [[1700699921.774481500, {}], {"_map_data.key"=>"value", "_map_data.id"=>123}]
@nokute78 Thanks for looking into this and providing a repro
Any idea on how we can get it fixed?
+1 facing this issue as well. Could you please fix? Thank you!
+1 Thank you!
Could someone pick this up, please? 🙏
Note: Removing Add_prefix is working on v2.1.10.
Hmm.
[SERVICE]
Log_Level debug
[INPUT]
Name dummy
Dummy {"map_data":{"key":"value", "id":123}}
samples 1
[FILTER]
Name nest
Match *
Operation lift
Nested_under map_data
# Add_prefix _map_data.
[OUTPUT]
name stdout
match *
I tested out_file to get raw msgpack output. v2.1.10 is wrong str8 for "_map_data.key" that length is 13.
| Version | Raw | Note |
|---|---|---|
| v2.1.9 | ad 5f 6d 61 70 5f 64 61 74 61 2e 6b 65 79 | ad means fixstr and length is 13(0xd) |
| v2.1.10 | d9 22 5f 6d 61 70 5f 64 61 74 61 2e 6b | d9 22 means str8 and length is 34 (0x22) |
Output in hex is v2.1.9:
00000000 dd 00 00 00 02 dd 00 00 00 02 d7 00 00 00 00 02 |................|
00000010 00 61 b0 98 80 df 00 00 00 02 ad 5f 6d 61 70 5f |.a........._map_|
00000020 64 61 74 61 2e 6b 65 79 a5 76 61 6c 75 65 ac 5f |data.key.value._|
00000030 6d 61 70 5f 64 61 74 61 2e 69 64 7b |map_data.id{|
0000003c
v2.1.10:
00000000 dd 00 00 00 02 dd 00 00 00 02 d7 00 00 00 00 02 |................|
00000010 00 41 32 a0 80 df 00 00 00 02 d9 22 5f 6d 61 70 |.A2........"_map|
00000020 5f 64 61 74 61 2e 6b 65 79 a5 76 61 6c 75 65 ba |_data.key.value.|
00000030 5f 6d 61 70 5f 64 61 74 61 2e 69 64 7b |_map_data.id{|
0000003d
[SERVICE]
Log_Level debug
[INPUT]
Name dummy
Dummy {"map_data":{"key":"value", "id":123}}
samples 1
Start_time_sec 1
[FILTER]
Name nest
Match *
Operation lift
Nested_under map_data
Add_prefix _map_data.
[OUTPUT]
Name file
Format msgpack
@leonardo-albertovich Could you check this issue ? This issue was happended from v2.1.10 only on Windows.
We modified log_event_encoder issue for Windows. I think it relates this issue. https://github.com/fluent/fluent-bit/pull/7971 https://fluentbit.io/announcements/v2.1.10/
The issue is that Add_prefix of filter_nest doesn't work.
helper_pack_string_add_prefix may not work that using flb_log_event_encoder.
https://github.com/fluent/fluent-bit/blob/v2.1.10/plugins/filter_nest/nest.c#L196
static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
struct filter_nest_ctx *ctx,
const char *str,
int len)
{
flb_log_event_encoder_append_body_values(
log_encoder,
FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}
The flb_log_event_encoder patch is to fix following flb_log_event_encoder_append_body_values and string issue.
It is similar to helper_pack_string_add_prefix.
https://github.com/fluent/fluent-bit/pull/7736#issuecomment-1712654775
assigned to @leonardo-albertovich
+1
+1 Thank you
@leonardo-albertovich @edsiper Are there any ETA on a fix for this issue? We are still blocked from upgrading fluent-bit on our Windows nodes
Anyone that can pick up this one?
@leonardo-albertovich I think this issue is caused by a type of flb_log_event_encoder_size_t.
It is a size_t* on Windows x64.
Following function adds flb_log_event_encoder_size_t. ctx->prefix_len + len
static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
struct filter_nest_ctx *ctx,
const char *str,
int len)
{
flb_log_event_encoder_append_body_values(
log_encoder,
FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}
ctx->prefix_len is 10 and len is 3 when fluent-bit uses following configuration.
https://github.com/fluent/fluent-bit/issues/8137#issuecomment-1848790674
However ctx->prefix_len + len is 34.
This is caused by 10 + sizeof(size_t) * 3 since flb_log_event_encoder_size_t is a pointer of size_t.
sizeof(size_t) is 8 on Windows x64.
@leonardo-albertovich
Following code is a test code for this issue.
It will output Len: size=34
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <stdint.h>
typedef size_t* flb_log_event_encoder_size_t;
#define FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE 1
#define FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE 2
#define FLB_LOG_EVENT_STRING_LENGTH_VALUE(length) \
(int) FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE, \
(flb_log_event_encoder_size_t) length
#define FLB_LOG_EVENT_STRING_BODY_VALUE(buffer, length) \
(int) FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE, \
(char *) buffer, \
(flb_log_event_encoder_size_t) length
void Test(int loop_cnt, ...)
{
va_list arg;
int type1;
flb_log_event_encoder_size_t size;
char* str;
int i;
va_start(arg, loop_cnt);
for (i = 0; i < loop_cnt; i++) {
printf("i=%d\n", i);
type1 = va_arg(arg, int);
if (type1 != FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE && type1 != FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
printf("%d:type is not length or body\n", i);
}
if (type1 == FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE) {
size = va_arg(arg, flb_log_event_encoder_size_t);
printf("Len: size=%ld\n", size);
}
else if (type1 == FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
str = va_arg(arg, char*);
printf("Body: str=%s\n", str);
size = va_arg(arg, flb_log_event_encoder_size_t);
printf("Body: size=%ld\n", size);
}
}
va_end(arg);
}
int main(void) {
int cnt = 3;
size_t prefix_len = 10;
size_t len = 3;
Test(cnt,
FLB_LOG_EVENT_STRING_LENGTH_VALUE(prefix_len+len),
FLB_LOG_EVENT_STRING_BODY_VALUE("_map_data.", prefix_len),
FLB_LOG_EVENT_STRING_BODY_VALUE("key", len)
);
return 0;
}
I don't have build environment for Windows but I sent a patch https://github.com/fluent/fluent-bit/pull/8454 Could you try it ?
@nokute78 Is there a GitHub build for that so that I can pull a container image with the patch for testing?
@janlonsetteig Sorry I don't know.
@patrick-stephens Any ideas to get a following patched container image for windows ? https://github.com/fluent/fluent-bit/pull/8454 ?
We do not currently build Windows images for PRs (they take an absolute age and Windows runners cost more as well). Potentially you may be able to tweak the PR to include a CI update to build them though.
I'll check if I can trigger a build manually as well, there are some additional workflows for that but not used them for a Windows container before...
@nokute78 if we have a branch on this repo (the main OSS one) then we can trigger a build of all images via: https://github.com/fluent/fluent-bit/actions/workflows/build-branch-containers.yaml
These will then be build and uploaded with a custom tag for people to pull from.
Or you can trigger the same workflow on your side @nokute78? It'll be a public image under ghcr.io/
@patrick-stephens Thank you for information. I tried to run action https://github.com/nokute78/fluent-bit/actions/runs/8012956051
@patrick-stephens Action was failed because of OpenSSL. Could you check it ? https://github.com/nokute78/fluent-bit/actions/runs/8012956051/job/21889201105
Hi!! Recent changes could fix this OpenSSL breakage: https://github.com/fluent/fluent-bit/commit/9652b0dc6fd9e59ef87c3bdc59c89da9abed995b
Could you rebase your branch, @nokute78 ?
@cosmo0920 Thank you for information.
I rebased and run the action. https://github.com/nokute78/fluent-bit/pkgs/container/fluent-bit%2Ftest%2Ffix_8137
@nokute78 I'm running ghcr.io/nokute78/fluent-bit/test/fix_8137:windows-2019-fix_8137 in one of our clusters now. It seems to fix the issue. I see logs working again now with nest/lift
@nokute78 Any progress in verifying and releasing the fix?