fluent-bit
fluent-bit copied to clipboard
out_azure_kusto : fix multiple files tail issue and timeout issue
- added kusto specific headers
- randomized kusto ingestion resources refresh interval
- made kusto ingestion resources refresh interval configurable
- introduced gzip compression for payload
- added dynamic parsing of azure kusto ingestion resources
- fixed deadlock and added granular locks
- added default kusto endpoints connection timeout interval configs
Enter [N/A] in the box, if an item is not applicable to your change.
Testing Before we can approve your change; please submit the following in a comment:
- [x] Example configuration file for the change
[OUTPUT] name azure_kusto match * tenant_id xxxxxxx client_id xxxxxxx client_secret xxxxxx ingestion_endpoint https://ingest-xxxxxx.kusto.windows.net database_name xxxxx table_name FluentBit ingestion_endpoint_connect_timeout 600 compression_enabled On ingestion_resources_refresh_interval 7200
- [ ] Debug log output from testing the change
- [x] Attached Valgrind output that shows no leaks or memory corruption was found
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
- [ ] Run local packaging test showing all targets (including any new ones) build.
- [ ] Set
ok-package-testlabel to test for all targets (requires maintainer to do).
Documentation
- [x] Documentation required for this feature
https://github.com/fluent/fluent-bit-docs/pull/1405/files
Backporting
- [ ] Backport to latest stable release.
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Hi tanmaya, facing an issue with regards to multiple tail inputs, Hope you can help me: https://github.com/fluent/fluent-bit/issues/8419
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
Added documentation for the above changes
https://github.com/fluent/fluent-bit-docs/pull/1405/files
@edsiper @patrick-stephens I really need these fixes 😅 , is there anything I can do to help expedite this?
@edsiper @patrick-stephens I really need these fixes 😅 , is there anything I can do to help expedite this?
Can you prove they work on an actual environment? We don't have any way to test in anger so if you can test manually these to add some confidence that will help
we will move this to the next milestone since we need more details about what it does plus we need a way to test it
@tanmaya-panda1 Could you provide some sample configuration that uses these changes and Valgrind output showing no mem-leak? That way this PR would get more confidence.
See one of these PRs for example: https://github.com/fluent/fluent-bit/pull/7663 https://github.com/fluent/fluent-bit/pull/7155
If you provide sample config, I can do quick testing on my end.
Thanks @kforeverisback for your comments.
The valgrind output is as follows
==43298== Memcheck, a memory error detector ==43298== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==43298== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==43298== Command: /home/fluentbitvm/fluent-bit/build/bin/fluent-bit -c fluent-bit.conf ==43298== Parent PID: 43297 ==43298== ==43298== ==43298== HEAP SUMMARY: ==43298== in use at exit: 78 bytes in 2 blocks ==43298== total heap usage: 3,441,530 allocs, 3,441,528 frees, 5,822,221,780 bytes allocated ==43298== ==43298== 17 bytes in 1 blocks are definitely lost in loss record 1 of 2 ==43298== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==43298== by 0x484F949: ??? ==43298== by 0x484E7FF: ??? ==43298== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDCB1F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==43298== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C02B77: OPENSSL_init_crypto (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== ==43298== 61 bytes in 1 blocks are definitely lost in loss record 2 of 2 ==43298== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==43298== by 0x484FA76: ??? ==43298== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==43298== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x484E9E4: ??? ==43298== by 0x4BDDD4A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDCC86: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== ==43298== LEAK SUMMARY: ==43298== definitely lost: 78 bytes in 2 blocks ==43298== indirectly lost: 0 bytes in 0 blocks ==43298== possibly lost: 0 bytes in 0 blocks ==43298== still reachable: 0 bytes in 0 blocks ==43298== suppressed: 0 bytes in 0 blocks ==43298== ==43298== For lists of detected and suppressed errors, rerun with: -s ==43298== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
and sample configs are
[SERVICE] Daemon Off Log_Level trace Grace 30 Flush 5
[INPUT] Name dummy Tag dummy.log Rate 1000 Dummy {"message": "Hello Fluent Bit!"}
[OUTPUT] Name azure_kusto Match * tenant_id xxxxxx client_id xxxxx client_secret xxxxx ingestion_endpoint https://ingest-xxxxx.kusto.windows.net/ database_name xxx table_name FluentBit ingestion_endpoint_connect_timeout 600 Retry_Limit 5 compression_enabled On
sample fluentbit config #2
logLevel: trace namespace: default
kind: DaemonSet
config: service: | [SERVICE] Daemon Off Flush 5 Log_Level trace HTTP_Server On HTTP_Listen 0.0.0.0 Parsers_File parsers.conf Parsers_File custom_parsers.conf HTTP_Port 2020 Health_Check On
inputs: | [INPUT] Name tail Path /var/log/containers/.log multiline.parser docker, cri Tag kube. DB /var/log/flb_kube.db Read_from_Head False Mem_Buf_Limit 1000MB Skip_Long_Lines On Refresh_Interval 1 Buffer_Max_Size 2MB Buffer_Chunk_Size 256k
filters: | [FILTER] Name kubernetes Match kube.* Merge_Log On Keep_Log On Merge_Log_key parsed_message K8S-Logging.Parser Off K8S-Logging.Exclude Off
outputs: | [OUTPUT] name azure_kusto match * tenant_id xxxxx client_id xxxx client_secret xxxxx ingestion_endpoint https://ingest-xxxx.kusto.windows.net database_name xxx table_name FluentBit ingestion_endpoint_connect_timeout 600 Retry_Limit 5 compression_enabled On
customParsers: | [PARSER] Name docker_no_time Format json Time_Keep On Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L
Thanks @kforeverisback for your comments.
The valgrind output is as follows
==43298== Memcheck, a memory error detector ==43298== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==43298== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==43298== Command: /home/fluentbitvm/fluent-bit/build/bin/fluent-bit -c fluent-bit.conf ==43298== Parent PID: 43297 ==43298== ==43298== ==43298== HEAP SUMMARY: ==43298== in use at exit: 78 bytes in 2 blocks ==43298== total heap usage: 3,441,530 allocs, 3,441,528 frees, 5,822,221,780 bytes allocated ==43298== ==43298== 17 bytes in 1 blocks are definitely lost in loss record 1 of 2 ==43298== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==43298== by 0x484F949: ??? ==43298== by 0x484E7FF: ??? ==43298== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDCB1F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==43298== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C02B77: OPENSSL_init_crypto (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== ==43298== 61 bytes in 1 blocks are definitely lost in loss record 2 of 2 ==43298== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==43298== by 0x484FA76: ??? ==43298== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==43298== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x484E9E4: ??? ==43298== by 0x4BDDD4A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4BDCC86: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==43298== ==43298== LEAK SUMMARY: ==43298== definitely lost: 78 bytes in 2 blocks ==43298== indirectly lost: 0 bytes in 0 blocks ==43298== possibly lost: 0 bytes in 0 blocks ==43298== still reachable: 0 bytes in 0 blocks ==43298== suppressed: 0 bytes in 0 blocks ==43298== ==43298== For lists of detected and suppressed errors, rerun with: -s ==43298== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
and sample configs are
[SERVICE] Daemon Off Log_Level trace Grace 30 Flush 5
[INPUT] Name dummy Tag dummy.log Rate 1000 Dummy {"message": "Hello Fluent Bit!"}
[OUTPUT] Name azure_kusto Match * tenant_id xxxxxx client_id xxxxx client_secret xxxxx ingestion_endpoint https://ingest-xxxxx.kusto.windows.net/ database_name xxx table_name FluentBit ingestion_endpoint_connect_timeout 600 Retry_Limit 5 compression_enabled On
Looks like there are some mem leaks. Although, it doesn't look like it's from your portion of the code. Have you run base fluentbit (without your code changes) with vagrant and see if there are some leaks?
@kforeverisback I tried using the following config settings
[SERVICE] Daemon Off Log_Level trace Grace 30 Flush 5
[INPUT] Name dummy Tag dummy.log Rate 1000 Dummy {"message": "Hello Fluent Bit!"}
[OUTPUT] Name stdout Match *
and the valgrind output of the above is clean
==110180== Memcheck, a memory error detector ==110180== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==110180== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==110180== Command: /home/fluentbitvm/fluent-bit/build/bin/fluent-bit -c fluent-bit-blob.conf ==110180== Parent PID: 110179 ==110180== ==110180== ==110180== HEAP SUMMARY: ==110180== in use at exit: 0 bytes in 0 blocks ==110180== total heap usage: 1,759,601 allocs, 1,759,601 frees, 3,290,316,523 bytes allocated ==110180== ==110180== All heap blocks were freed -- no leaks are possible ==110180== ==110180== For lists of detected and suppressed errors, rerun with: -s ==110180== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Since, the leaks mentioned in the valgrind report with azure_output is not related to the changes, I am wondering what can be the possible cause for the same. It will be really great @kforeverisback if you can help me on this.
@kforeverisback I ran valgrind for the exisiting master branch with the following config
[SERVICE] Daemon Off Log_Level trace Grace 30 Flush 5
[INPUT] Name dummy Tag dummy.log Rate 1000 Dummy {"message": "Hello Fluent Bit!"}
[OUTPUT] Name azure_kusto Match * tenant_id xxxxxx client_id xxxxx client_secret xxxxx ingestion_endpoint https://ingest-xxxxx.kusto.windows.net/ database_name xxx table_name FluentBitTemp
found the leak to be pre-exisiting
==123992== Memcheck, a memory error detector ==123992== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==123992== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==123992== Command: /home/fluentbitvm/fluent-bit/build/bin/fluent-bit -c fluent-bit-test.conf ==123992== Parent PID: 123991 ==123992== ==123992== ==123992== HEAP SUMMARY: ==123992== in use at exit: 78 bytes in 2 blocks ==123992== total heap usage: 2,827,613 allocs, 2,827,611 frees, 4,719,153,733 bytes allocated ==123992== ==123992== 17 bytes in 1 blocks are definitely lost in loss record 1 of 2 ==123992== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==123992== by 0x484F949: ??? ==123992== by 0x484E7FF: ??? ==123992== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4BDCB1F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==123992== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4C02B77: OPENSSL_init_crypto (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== ==123992== 61 bytes in 1 blocks are definitely lost in loss record 2 of 2 ==123992== at 0x483B7F3: malloc (vg_replace_malloc.c:309) ==123992== by 0x484FA76: ??? ==123992== by 0x486D4DE: __pthread_once_slow (pthread_once.c:116) ==123992== by 0x4C6E5AC: CRYPTO_THREAD_run_once (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x484E9E4: ??? ==123992== by 0x4BDDD4A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4BDD49A: ENGINE_ctrl_cmd_string (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4BDCC86: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6B2A9: CONF_modules_load (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6B88E: CONF_modules_load_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4B6BB3F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== by 0x4C024F3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==123992== ==123992== LEAK SUMMARY: ==123992== definitely lost: 78 bytes in 2 blocks ==123992== indirectly lost: 0 bytes in 0 blocks ==123992== possibly lost: 0 bytes in 0 blocks ==123992== still reachable: 0 bytes in 0 blocks ==123992== suppressed: 0 bytes in 0 blocks ==123992== ==123992== For lists of detected and suppressed errors, rerun with: -s ==123992== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
@tanmaya-panda1 I'll test your original config with a Kusto Cluster I have up and running sometimes tomorrow or next week.
@kforeverisback Hope you are doing good. Any update?
@edsiper @patrick-stephens @kforeverisback @fujimotos @koleini @leonardo-albertovich Can you guys please let us know if we have any updates to this PR, as you can see in the comments we do have few customers waiting for the fix to adopt fluentbit into their deployment landscape.
Would highly appreciate if we could push this
@edsiper @patrick-stephens I really need these fixes 😅 , is there anything I can do to help expedite this?
Can you prove they work on an actual environment? We don't have any way to test in anger so if you can test manually these to add some confidence that will help
Yes, saw it just now. We are running the binary from that branch and it's working great
@edsiper @patrick-stephens I really need these fixes 😅 , is there anything I can do to help expedite this?
Can you prove they work on an actual environment? We don't have any way to test in anger so if you can test manually these to add some confidence that will help
Yes, saw it just now. We are running the binary from that branch and it's working great
Any chance of a bit more detail?
@edsiper @patrick-stephens I really need these fixes 😅 , is there anything I can do to help expedite this?
Can you prove they work on an actual environment? We don't have any way to test in anger so if you can test manually these to add some confidence that will help
Yes, saw it just now. We are running the binary from that branch and it's working great
Any chance of a bit more detail?
Sure, thanks for the quick reply!
We have been working with @tanmaya-panda1 in the last couple of months to get this plugin to work (and work well in scale). The current version of the kusto output plugin doesn't work, the fixes in this PR are to fix a connection issues and deadlock we experienced.
We are running fluent-bit as a daemonset on many clusters with anywhere from 30-160 nodes and a fairly high volume of logs ~60TB/d (sent not processed)
Please let me know if you want to know anything else / more specifics