fluent-bit
fluent-bit copied to clipboard
[windows] upstream connection failed to logs.eu-west-2.amazonaws.com
Bug Report
Describe the bug Brand new setup. OS Name: Microsoft Windows Server 2019 Datacenter OS Version: 10.0.17763 N/A Build 17763 Fluent Bit Version: 1.8.11-win64
After fluent-bit is started with the following command - .\bin\fluent-bit.exe -c .\conf\fluent-bit.conf - the following errors are seen:
[2022/01/24 15:34:22] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream application.C.var.log.containers.deployment-56bcb45977-4x72l_default_container-fcaeb2263bf6a440a2224bc7139cc1c6db99990d1dfe21bf579d9f201b00e2ca.log in loggroup /application [2022/01/24 15:34:22] [debug] [upstream] connection #996 failed to logs.eu-west-2.amazonaws.com:443 [2022/01/24 15:34:22] [error] [aws_client] connection initialization error [2022/01/24 15:34:22] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to create log stream
fluent-bit.conf
[SERVICE] Flush 5 Log_Level trace Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum off storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
application-log.conf
[INPUT] Name tail Tag application.* Path C:\var\log\containers*.log multiline.parser docker
[OUTPUT] Name cloudwatch_logs Match application.* region eu-west-2 log_group_name /application log_stream_prefix testing- auto_create_group false extra_user_agent container-insights
IAM Role permissions:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Action": [ "logs:PutLogEvents", "logs:DescribeLogStreams", "logs:CreateLogStream", "logs:CreateLogGroup" ], "Resource": "arn:aws:logs:eu-west-2::*" } ] }
IAM Role logs:
[2022/01/25 07:45:28] [debug] [aws_credentials] Initialized Env Provider in standard chain [2022/01/25 07:45:28] [ warn] [aws_credentials] Failed to initialize profile provider: HOME, AWS_CONFIG_FILE, and AWS_SHARED_CREDENTIALS_FILE not set. [2022/01/25 07:45:28] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set [2022/01/25 07:45:28] [debug] [aws_credentials] Not initializing ECS Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set [2022/01/25 07:45:28] [debug] [aws_credentials] Initialized EC2 Provider in standard chain [2022/01/25 07:45:28] [debug] [aws_credentials] Sync called on the EC2 provider [2022/01/25 07:45:28] [debug] [aws_credentials] Init called on the env provider [2022/01/25 07:45:28] [debug] [aws_credentials] Init called on the EC2 IMDS provider [2022/01/25 07:45:28] [debug] [aws_credentials] requesting credentials from EC2 IMDS [2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header [2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736 [2022/01/25 07:45:28] [debug] [aws_client] (null): http_do=0, HTTP Status: 401 [2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header [2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736 [2022/01/25 07:45:28] [debug] [imds] using IMDSv2 [2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header [2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736 [2022/01/25 07:45:28] [debug] [aws_credentials] Requesting credentials for instance role eks-NodeInstanceRole-XXXXXXXXX [2022/01/25 07:45:28] [debug] [imds] using IMDSv2 [2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header [2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736 [2022/01/25 07:45:28] [debug] [aws_credentials] upstream_set called on the EC2 provider [2022/01/25 07:45:28] [debug] [router] match rule tail.0:cloudwatch_logs.0 [2022/01/25 07:45:28] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2022/01/25 07:45:28] [ info] [sp] stream processor started
- Connectivity to logs.eu-west-2.amazonaws.com:443 is successful. Checked with WSL (netcat) and Powershell.
To Reproduce
- Use above configurations for fluent-bit and run the following command ".\bin\fluent-bit.exe -c .\conf\fluent-bit.conf"
- AWS EC2 machine with instance profile with above permissions
Expected behavior
- Fluent-bit to be able to start, create CloudWatch log groups and streams and send logs.
Actual behavior
- Fluent-bit starts but is unable to create log streams and therefore keeps retrying.
Just as an update when using the WSL on the Windows server to push a single log, it works fine. However as the logs on the server are not having the same directory structure from WSL point of view and the Linux version of fluent-bit does not provide a /some/directory/**/*.log I cannot use WSL as a workaround. The test shows that the machine has access to CW logs from networking and IAM perspective.
A little bit more info that might help in troubleshooting this issue. As per the logs I posted in #4727 the error is coming from this function called here. As -1
is used as the return code for each error, it's not clear exactly which line of code causes the issue. Perhaps adding more debug statements or using different error codes will help?
Please note that this error is happening only in the fluent-bit pod's C code. I installed AWS CLI v2 on the pod after it started running, and executed a command aws logs create-log-group
. That worked without any error, so there is no issue in the pod's network connectivity to AWS STS / CW.
I noticed that you are running in TRACE level but I don't see much there, would you be able to share the complete log file? Also, maybe someone else remembers this better, but, since there is a socket number in the log line and no indications of a timeout, I'm wondering if this could be related to SSL. I remember there was a PR that improved certificate loading in windows but I can't remember in which version it was included, would you be able to test the latest version of the 1.8 branch (master would be great too).
I am seeing this issue as well running FluentBit v1.9.1 in a Windows container based on mcr.microsoft.com/windows/servercore:ltsc2019
on Kubernetes 1.21.
Underlying Windows node is running Windows Server 2019.
My logs include additional message [tls] error: unexpected EOF
output by this line (introduced in v1.9.0):
https://github.com/fluent/fluent-bit/blob/da871ff712dde152c5325257808e997f336dadc4/src/tls/openssl.c#L488
This issue appears to be similar to #5381.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Please remove stale
I am seeing what appears to be the same issue (I came here from #4727 which has a clearer description of how to repro the problem).
The problem looks similar to #4735 which appears to have been fixed in 1.9.0. I see a big change has modified src/tls/openssl.c
since that time. Is it possible that something regressed this functionality?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.