fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Filter nest/lift is broken on Windows in 2.1.10 and 2.2.0 =>logs corrupted

Open janlonsetteig opened this issue 2 years ago • 29 comments

Bug Report

Describe the bug After upgrading from Fluent Bit 2.1.9 to 2.1.10 the nest/lift filter is no longer working and corrupts the logs

My case are Windows pods in k8s

Fluent Bit reads the log files and adds Kubernetes information. Then it nest/lift the Kubernetes fields

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        DNS_Retries         10
        DNS_Wait_Time       30
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              On
        Annotations         Off
    [FILTER]
        Name                nest
        Match               kube.*
        Operation           lift
        Nested_under        kubernetes
        Add_prefix          _kubernetes.

The result in cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.10 would look something like this:

{"time"=>"2023-11-06T14:04:33.8847011Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://art.staging.firmglobal.net/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"12622", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"23015ddcfebe4f7e9fc2d180318e0adb", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:04:33.863+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Host={art.staging.firmglobal.net}", "_kubernetes.pod_name�identity3-api-7c5c56447f-hb6rv�|_kubernetes.namespace_n"=>97, 109=>101, "horizons"=>"_kubernetes.pod_id�$09cc3c99-a4ac-4f54-a884-02bf196b5069�<_k", 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97}

As you can see there are some weird characters in the log after doing nest/lift. And also some weird 117=>98, 101=>114, 110=>101, 116=>101, 115=>46, 108=>97.

All this works fine when I revert the container image to cr.fluentbit.io/fluent/fluent-bit:windows-2019-2.1.9

This is how the output looks in 2.1.9

{"time"=>"2023-11-06T14:34:58.7061335Z", "stream"=>"stdout", "_p"=>"F", "RequestHttpVerb"=>"GET", "RequestUri"=>"http://10.241.74.235/identity/.well-known/openid-configuration", "ResponseStatusCode"=>"200", "TimeTakenMicros"=>"34615", "Service"=>"Confirmit.Identity", "ServiceVersion"=>"2.24.1753", "CorrelationId"=>"b3f7e7854ae041ce9bb3baaeeeade543", "InitiatingService"=>"Confirmit.Identity", "EventTime"=>"2023-11-06T14:34:58.684+00", "Severity"=>"INFO", "Logger"=>"Confirmit.Logging.Web.AccessLogger", "RequestHeaders"=>"Connection={close}, Accept={*/*}, Host={10.241.74.235:80}", "_kubernetes.pod_name"=>"identity3-api-7c5c56447f-jkhhf", "_kubernetes.namespace_name"=>"horizons", "_kubernetes.pod_id"=>"b56eef39-ac81-4efc-959f-bea7dd3b0193", "_kubernetes.labels"=>{"app"=>"identity3-api", "app.confirmit.com/template"=>"iis-app", "app.confirmit.com/template-version"=>"11.2.0", "app.kubernetes.io/instance"=>"horizons", "app.kubernetes.io/managed-by"=>"Helm", "app.kubernetes.io/name"=>"identity3-api", "app.kubernetes.io/version"=>"2.24.1753", "helm.sh/chart"=>"identity3-api-2.24.1753", "pod-template-hash"=>"7c5c56447f", "version"=>"2.24.1753"}, "_kubernetes.host"=>"akswinu7f0000t2", "_kubernetes.container_name"=>"identity3-api", "_kubernetes.docker_id"=>"0a4d2404db4070675163eac1bf4cc7062d82ffd54c71fadbc14045436c5cee24", "_kubernetes.container_hash"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api@sha256:0b6349dab35536371cc88385bacc33cfb1138ef9ddc99729bc6efb2a71870f9f", "_kubernetes.container_image"=>"confirmithorizonsdev.azurecr.io/confirmit/identity3-api:2.24.1753"}

For Linux containers nest/lift works as before after the upgrade. But I need it working for for Windows pods also

Let me know if you need more information to reproduce the issue

janlonsetteig avatar Nov 06 '23 14:11 janlonsetteig

anyone? :-)

janlonsetteig avatar Nov 07 '23 14:11 janlonsetteig

Confirmed also broken on version 2.2.0. Last working version is 2.1.9

janlonsetteig avatar Nov 20 '23 11:11 janlonsetteig

@cosmo0920 @nokute78 Hi. Just trying to tag some of you that contributed to the 2.1.10 release. Could you please take a look at this issue or tag someone that could know more about it. At the moment we are stuck with version 2.1.9 on the Windows nodes in our Kubernetes clusters. Both version 2.10 and 2.2.0 have this issue.

Let me know if I can provider more information to help out

janlonsetteig avatar Nov 20 '23 12:11 janlonsetteig

@cosmo0920 I don't have build environment on windows, but I tested it. v2.1.10 also didn't work using following configuration.

[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
    Add_prefix _map_data.

[OUTPUT]
    name stdout
    match *

It will fail cause of decoder error and new chunk size is different. chunk io was updated from v2.1.10. Is it a cause of this issue ?

v2.1.10:

/11/23 09:40:09] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:40:09] [debug] [input chunk] update output instances with new chunk size diff=61, records=1, input=dummy.0
[2023/11/23 09:40:10] [debug] [task] created task=0000027358FB7D50 id=0 OK
[2023/11/23 09:40:10] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/11/23 09:40:10] [error] [output:stdout:stdout.0] Log event decoder error : -12

v2.1.9

[2023/11/23 09:38:41] [debug] [filter:nest:nest.0] Lift : Outer map size is 1, will be 2, lifting 1 record(s)
[2023/11/23 09:38:41] [debug] [input chunk] update output instances with new chunk size diff=60, records=1, input=dummy.0
[2023/11/23 09:38:42] [debug] [task] created task=00000140C6B53EE0 id=0 OK
[0] dummy.0: [[1700699921.774481500, {}], {"_map_data.key"=>"value", "_map_data.id"=>123}]

nokute78 avatar Nov 23 '23 01:11 nokute78

@nokute78 Thanks for looking into this and providing a repro

Any idea on how we can get it fixed?

janlonsetteig avatar Nov 24 '23 12:11 janlonsetteig

+1 facing this issue as well. Could you please fix? Thank you!

anasalsalah avatar Dec 01 '23 17:12 anasalsalah

+1 Thank you!

jfaale avatar Dec 08 '23 11:12 jfaale

Could someone pick this up, please? 🙏

maksnester avatar Dec 08 '23 13:12 maksnester

Note: Removing Add_prefix is working on v2.1.10. Hmm.

[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
#    Add_prefix _map_data.

[OUTPUT]
    name stdout
    match *

nokute78 avatar Dec 09 '23 03:12 nokute78

I tested out_file to get raw msgpack output. v2.1.10 is wrong str8 for "_map_data.key" that length is 13.

Version Raw Note
v2.1.9 ad 5f 6d 61 70 5f 64 61 74 61 2e 6b 65 79 ad means fixstr and length is 13(0xd)
v2.1.10 d9 22 5f 6d 61 70 5f 64 61 74 61 2e 6b d9 22 means str8 and length is 34 (0x22)

Output in hex is v2.1.9:

00000000  dd 00 00 00 02 dd 00 00  00 02 d7 00 00 00 00 02  |................|
00000010  00 61 b0 98 80 df 00 00  00 02 ad 5f 6d 61 70 5f  |.a........._map_|
00000020  64 61 74 61 2e 6b 65 79  a5 76 61 6c 75 65 ac 5f  |data.key.value._|
00000030  6d 61 70 5f 64 61 74 61  2e 69 64 7b              |map_data.id{|
0000003c

v2.1.10:

00000000  dd 00 00 00 02 dd 00 00  00 02 d7 00 00 00 00 02  |................|
00000010  00 41 32 a0 80 df 00 00  00 02 d9 22 5f 6d 61 70  |.A2........"_map|
00000020  5f 64 61 74 61 2e 6b 65  79 a5 76 61 6c 75 65 ba  |_data.key.value.|
00000030  5f 6d 61 70 5f 64 61 74  61 2e 69 64 7b           |_map_data.id{|
0000003d
[SERVICE]
    Log_Level debug

[INPUT]
    Name dummy
    Dummy {"map_data":{"key":"value", "id":123}}
    samples 1
    Start_time_sec 1

[FILTER]
    Name nest
    Match *
    Operation lift
    Nested_under map_data
    Add_prefix _map_data.

[OUTPUT]
    Name file
    Format msgpack

nokute78 avatar Dec 10 '23 00:12 nokute78

@leonardo-albertovich Could you check this issue ? This issue was happended from v2.1.10 only on Windows.

We modified log_event_encoder issue for Windows. I think it relates this issue. https://github.com/fluent/fluent-bit/pull/7971 https://fluentbit.io/announcements/v2.1.10/

The issue is that Add_prefix of filter_nest doesn't work. helper_pack_string_add_prefix may not work that using flb_log_event_encoder. https://github.com/fluent/fluent-bit/blob/v2.1.10/plugins/filter_nest/nest.c#L196

static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
        struct filter_nest_ctx *ctx,
        const char *str,
        int len)
{
    flb_log_event_encoder_append_body_values(
        log_encoder,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}

The flb_log_event_encoder patch is to fix following flb_log_event_encoder_append_body_values and string issue. It is similar to helper_pack_string_add_prefix. https://github.com/fluent/fluent-bit/pull/7736#issuecomment-1712654775

nokute78 avatar Dec 10 '23 02:12 nokute78

assigned to @leonardo-albertovich

edsiper avatar Dec 11 '23 02:12 edsiper

+1

ThomasRikardsen avatar Dec 11 '23 11:12 ThomasRikardsen

+1 Thank you

TSead avatar Dec 18 '23 20:12 TSead

@leonardo-albertovich @edsiper Are there any ETA on a fix for this issue? We are still blocked from upgrading fluent-bit on our Windows nodes

janlonsetteig avatar Jan 05 '24 08:01 janlonsetteig

Anyone that can pick up this one?

janlonsetteig avatar Jan 23 '24 15:01 janlonsetteig

@leonardo-albertovich I think this issue is caused by a type of flb_log_event_encoder_size_t. It is a size_t* on Windows x64.

Following function adds flb_log_event_encoder_size_t. ctx->prefix_len + len

static void helper_pack_string_add_prefix(struct flb_log_event_encoder *log_encoder,
        struct filter_nest_ctx *ctx,
        const char *str,
        int len)
{
    flb_log_event_encoder_append_body_values(
        log_encoder,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(ctx->prefix_len + len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(ctx->prefix, ctx->prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE(str, len));
}

ctx->prefix_len is 10 and len is 3 when fluent-bit uses following configuration. https://github.com/fluent/fluent-bit/issues/8137#issuecomment-1848790674 However ctx->prefix_len + len is 34.

This is caused by 10 + sizeof(size_t) * 3 since flb_log_event_encoder_size_t is a pointer of size_t. sizeof(size_t) is 8 on Windows x64.

nokute78 avatar Feb 04 '24 08:02 nokute78

@leonardo-albertovich Following code is a test code for this issue. It will output Len: size=34

#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <stdint.h>

typedef size_t* flb_log_event_encoder_size_t;
#define FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE           1
#define FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE             2

#define FLB_LOG_EVENT_STRING_LENGTH_VALUE(length) \
            (int) FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE, \
            (flb_log_event_encoder_size_t) length
#define FLB_LOG_EVENT_STRING_BODY_VALUE(buffer, length) \
            (int) FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE, \
            (char *) buffer, \
            (flb_log_event_encoder_size_t) length
void Test(int loop_cnt, ...)
{
    va_list arg;
    int type1;
    flb_log_event_encoder_size_t size;
    char* str;
    int i;
    va_start(arg, loop_cnt);

    for (i = 0; i < loop_cnt; i++) {
        printf("i=%d\n", i);

        type1 = va_arg(arg, int);
        if (type1 != FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE && type1 != FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
            printf("%d:type is not length or body\n", i);
        }

        if (type1 == FLB_LOG_EVENT_STRING_LENGTH_VALUE_TYPE) {
            size = va_arg(arg, flb_log_event_encoder_size_t);
            printf("Len: size=%ld\n", size);
        }
        else if (type1 == FLB_LOG_EVENT_STRING_BODY_VALUE_TYPE) {
            str = va_arg(arg, char*);
            printf("Body: str=%s\n", str);

            size = va_arg(arg, flb_log_event_encoder_size_t);
            printf("Body: size=%ld\n", size);
            
        }
    }
    va_end(arg);
}

int main(void) {
    int cnt = 3;
    size_t prefix_len = 10;
    size_t len = 3;

    Test(cnt,
        FLB_LOG_EVENT_STRING_LENGTH_VALUE(prefix_len+len),
        FLB_LOG_EVENT_STRING_BODY_VALUE("_map_data.", prefix_len),
        FLB_LOG_EVENT_STRING_BODY_VALUE("key", len)
         );
    return 0;
}

nokute78 avatar Feb 04 '24 08:02 nokute78

I don't have build environment for Windows but I sent a patch https://github.com/fluent/fluent-bit/pull/8454 Could you try it ?

nokute78 avatar Feb 04 '24 11:02 nokute78

@nokute78 Is there a GitHub build for that so that I can pull a container image with the patch for testing?

janlonsetteig avatar Feb 16 '24 11:02 janlonsetteig

@janlonsetteig Sorry I don't know.

@patrick-stephens Any ideas to get a following patched container image for windows ? https://github.com/fluent/fluent-bit/pull/8454 ?

nokute78 avatar Feb 17 '24 02:02 nokute78

We do not currently build Windows images for PRs (they take an absolute age and Windows runners cost more as well). Potentially you may be able to tweak the PR to include a CI update to build them though.

I'll check if I can trigger a build manually as well, there are some additional workflows for that but not used them for a Windows container before...

patrick-stephens avatar Feb 19 '24 10:02 patrick-stephens

@nokute78 if we have a branch on this repo (the main OSS one) then we can trigger a build of all images via: https://github.com/fluent/fluent-bit/actions/workflows/build-branch-containers.yaml

These will then be build and uploaded with a custom tag for people to pull from.

patrick-stephens avatar Feb 19 '24 10:02 patrick-stephens

Or you can trigger the same workflow on your side @nokute78? It'll be a public image under ghcr.io//... but that should not matter.

patrick-stephens avatar Feb 19 '24 10:02 patrick-stephens

@patrick-stephens Thank you for information. I tried to run action https://github.com/nokute78/fluent-bit/actions/runs/8012956051

nokute78 avatar Feb 23 '24 00:02 nokute78

@patrick-stephens Action was failed because of OpenSSL. Could you check it ? https://github.com/nokute78/fluent-bit/actions/runs/8012956051/job/21889201105

nokute78 avatar Feb 23 '24 01:02 nokute78

Hi!! Recent changes could fix this OpenSSL breakage: https://github.com/fluent/fluent-bit/commit/9652b0dc6fd9e59ef87c3bdc59c89da9abed995b

Could you rebase your branch, @nokute78 ?

cosmo0920 avatar Feb 23 '24 04:02 cosmo0920

@cosmo0920 Thank you for information.

I rebased and run the action. https://github.com/nokute78/fluent-bit/pkgs/container/fluent-bit%2Ftest%2Ffix_8137

nokute78 avatar Feb 23 '24 12:02 nokute78

@nokute78 I'm running ghcr.io/nokute78/fluent-bit/test/fix_8137:windows-2019-fix_8137 in one of our clusters now. It seems to fix the issue. I see logs working again now with nest/lift

janlonsetteig avatar Feb 23 '24 15:02 janlonsetteig

@nokute78 Any progress in verifying and releasing the fix?

janlonsetteig avatar Feb 29 '24 09:02 janlonsetteig