loki icon indicating copy to clipboard operation
loki copied to clipboard

lambda-promtail crashing with a panic

Open elliotdobson opened this issue 3 years ago • 7 comments

Describe the bug lambda-promtail crashing with a panic when trying to pull AWS load balancer logs from S3.

I am using a slightly modified version of the lambda-promtail CloudFormation template to deploy the AWS resources (it's missing the S3 permissions cf. terraform template) which pushes logs to a basic authentication Promtail instance (which then forwards the logs to our Loki instance).

To Reproduce Steps to reproduce the behavior:

  1. Create private ECR repository, pull public.ecr.aws/grafana/lambda-promtail:2.5.0-amd64 and push to private ECR repository.
  2. Use lambda-promtail CloudFormation template to create AWS resources.
  3. Create S3 Event Notification to run lambda-promtail function on all object create events in the AWS load balancer log S3 bucket.
  4. lambda-promtail function fails with below log

Expected behavior lambda-promtail to get log from S3 and send it to Promtail/Loki.

Environment:

  • Infrastructure: AWS Lambda using public.ecr.aws/grafana/lambda-promtail:2.5.0-amd64 container image
  • Deployment tool: AWS CloudFormation

Screenshots, Promtail config, or terminal output Logs from lambda function:

2022-06-22T12:58:07.336+12:00 | START RequestId: 7ed4d51b-2459-48b2-8336-16742c131614 Version: $LATEST
2022-06-22T12:58:07.336+12:00 | write address: https://{REDACTED}/aws-lb
2022-06-22T12:58:07.336+12:00 | keep stream: false
2022-06-22T12:58:07.630+12:00 | 2022-06-22 00:58:07.630385 I | calling the handler function resulted in a panic, the process should exit
2022-06-22T12:58:07.632+12:00 | END RequestId: 7ed4d51b-2459-48b2-8336-16742c131614
2022-06-22T12:58:07.632+12:00 | REPORT RequestId: 7ed4d51b-2459-48b2-8336-16742c131614 Duration: 292.89 ms Billed Duration: 2235 ms Memory Size: 128 MB Max Memory Used: 33 MB Init Duration: 1941.68 ms
2022-06-22T12:58:07.632+12:00 | Unknown application error occurred

elliotdobson avatar Jun 22 '22 01:06 elliotdobson

Assigning @cstyan to have a look when they get some time; I believe they wrote this code

dannykopping avatar Jun 22 '22 07:06 dannykopping

@elliotdobson can you try a newer version of the lambda-promtail image? we don't really version lambda-promtail like we do with loki release so just because there's a 2.5.0 tag doesn't mean it's a stable release like loki 2.5.0. So there could have been a bug in the 2.5 version. The ECR repo with images is here: https://gallery.ecr.aws/grafana/lambda-promtail

On top of that looking at the merge of the s3 feature for lambda-promtail and the branch that loki 2.5.0 is based off of, the lambda-promtail s3 support hadn't yet been merged. So again, I think upgrading your image version will help.

The last thing that is suspicious is that your lambda logging states that there's a panic but doesn't provide the go stacktrace from the panic.

cstyan avatar Jun 22 '22 23:06 cstyan

Hey @cstyan. I tried using public.ecr.aws/grafana/lambda-promtail@sha256:db33e17246d4e713d743717a20ea1757534e58c26220cb2d22e1ce489bf3f697 which was the latest main image at the time but it gave another error.

2022-06-29T11:05:32.561+12:00 | START RequestId: 5de33725-3312-4930-bcdc-eff2d08e7f82 Version: $LATEST 
2022-06-29T11:05:32.563+12:00 | IMAGE Launch error: fork/exec /app/main: exec format error Entrypoint: [/app/main] Cmd: [] WorkingDir: [/app]
2022-06-29T11:05:32.569+12:00 | END RequestId: 5de33725-3312-4930-bcdc-eff2d08e7f82

Here's a redacted version of the AWS CloudFormation template I'm using:

AWSTemplateFormatVersion: "2010-09-09"
Description: Creates AWS resources for lambda-promtail
Parameters:
  LambdaPromtailPassword:
    Description: The basic auth password, for the external-promtail endpoint.
    Type: String
    Default: ""
    NoEcho: true
Outputs:
  LambdaPromtailFunction:
    Description: Lambda Promtail Function ARN
    Value: !GetAtt LambdaFunctionLambdaPromtail.Arn
    Export:
      Name: lambda-promtail
Resources:
  IamRoleLambdaPromtail:
    Type: AWS::IAM::Role
    Properties:
      RoleName: lambda-promtail
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - lambda.amazonaws.com
          Action:
          - sts:AssumeRole
      Policies:
      - PolicyName: lambda-logs
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          -
            Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: arn:aws:logs:*:*:*
          -
            Effect: Allow
            Action: 's3:GetObject'
            Resource: 'arn:aws:s3:::aws-elb-logs/*'
  LambdaFunctionLambdaPromtail:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: "lambda-promtail"
      Code:
        ImageUri: "{PRIVATE_ECR}/lambda-promtail:main"
      MemorySize: 128
      PackageType: Image
      Timeout: 60
      Role: !GetAtt IamRoleLambdaPromtail.Arn
      ReservedConcurrentExecutions: 2
      Environment:
        Variables:
          WRITE_ADDRESS: "https://{REDACTED}/aws-lb"
          USERNAME: "lambda-promtail"
          PASSWORD: !Ref LambdaPromtailPassword
          KEEP_STREAM: "false"
          EXTRA_LABELS: ""
          TENANT_ID: ""
  LambdaPermissionLambdaPromtail:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !GetAtt LambdaFunctionLambdaPromtail.Arn
      Action: lambda:InvokeFunction
      Principal: s3.amazonaws.com
      SourceAccount: !Ref 'AWS::AccountId'
      SourceArn: 'arn:aws:s3:::aws-elb-logs'

The S3 Bucket CloudFormation template that the load balancer logs are in looks like:

AWSTemplateFormatVersion: "2010-09-09"
Description: Creates AWS resources for AWS load balancer logs
  S3BucketAwsElbLogs:
    Type: AWS::S3::Bucket
    DeletionPolicy: "Retain"
    Properties:
      BucketName: "aws-elb-logs"
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: "AES256"
      LifecycleConfiguration:
        Rules:
          - Id: Delete-Logs-After-30Days
            Status: Enabled
            ExpirationInDays: 30
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: "s3:ObjectCreated:*"
            Filter:
              S3Key:
                Rules:
                  - Name: "prefix"
                    Value: "ingress/"
                  - Name: "suffix"
                    Value: ".gz"
            Function: !ImportValue lambda-promtail

elliotdobson avatar Jun 28 '22 23:06 elliotdobson

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Aug 13 '22 19:08 stale[bot]

Hey @cstyan, I think the error in my previous comment may have been caused by trying to run an arm64 image on amd64 lambda.

I've just tested again using the latest main image (public.ecr.aws/grafana/lambda-promtail:main-2766e0a) on both arm64 and amd64 lambdas but I am still getting the same error as my original post.

2022-09-16T10:33:37.931+12:00 | write address: https://{REDACTED}/aws-lb
2022-09-16T10:33:37.931+12:00 | keep stream: false
2022-09-16T10:38:37.134+12:00 | START RequestId: 6553c683-f97e-4785-bd4a-bf656c6e70a7 Version: $LATEST
2022-09-16T10:38:37.397+12:00 | 2022-09-15 22:38:37.397454 I | calling the handler function resulted in a panic, the process should exit
2022-09-16T10:38:37.414+12:00 | END RequestId: 6553c683-f97e-4785-bd4a-bf656c6e70a7
2022-09-16T10:38:37.414+12:00 | REPORT RequestId: 6553c683-f97e-4785-bd4a-bf656c6e70a7 Duration: 278.55 ms Billed Duration: 639 ms Memory Size: 128 MB Max Memory Used: 32 MB Init Duration: 360.10 ms
2022-09-16T10:38:37.414+12:00 | Unknown application error occurred

Any suggestions would be much appreciated!

elliotdobson avatar Sep 15 '22 22:09 elliotdobson

Upon further investigation of the lambda-promtail code it looks like S3 ingester was developed with AWS Application Load Balancers in mind. However there are also AWS Network Load Balancers and AWS Classic Load Balancers that output access logs to S3. Unfortunately the file names AND log formats differ for each type of load balancer.

For reference I am using AWS Network Load Balancers.

I've managed to fix the panic in this commit by updating the regex used to match the file name of the log and timestamp in the log line (and the regex's should now support all three types of load balancers).

However I am still not receiving any logs into Loki. The lambda-promtail logs now look like:

2022-09-16T16:01:26.278+12:00 | write address: https://{REDACTED}/aws-lb
2022-09-16T16:01:26.278+12:00 | keep stream: false
2022-09-16T16:01:26.280+12:00 | START RequestId: a5cba339-b29b-44db-baa3-447331c25d84 Version: $LATEST
2022-09-16T16:01:26.587+12:00 | END RequestId: a5cba339-b29b-44db-baa3-447331c25d84
2022-09-16T16:01:26.587+12:00 | REPORT RequestId: a5cba339-b29b-44db-baa3-447331c25d84 Duration: 306.43 ms Billed Duration: 393 ms Memory Size: 128 MB Max Memory Used: 32 MB Init Duration: 86.48 ms
2022-09-16T16:01:31.873+12:00 | START RequestId: b7db0647-758d-441d-91f9-1f5f2b00f0eb Version: $LATEST
2022-09-16T16:01:31.894+12:00 | END RequestId: b7db0647-758d-441d-91f9-1f5f2b00f0eb
2022-09-16T16:01:31.894+12:00 | REPORT RequestId: b7db0647-758d-441d-91f9-1f5f2b00f0eb Duration: 19.39 ms Billed Duration: 20 ms Memory Size: 128 MB Max Memory Used: 32 MB
2022-09-16T16:03:36.592+12:00 | START RequestId: 17b7a6d3-4319-4500-b70a-dd16cf8095a5 Version: $LATEST
2022-09-16T16:03:36.691+12:00 | END RequestId: 17b7a6d3-4319-4500-b70a-dd16cf8095a5
2022-09-16T16:03:36.691+12:00 | REPORT RequestId: 17b7a6d3-4319-4500-b70a-dd16cf8095a5 Duration: 97.80 ms Billed Duration: 98 ms Memory Size: 128 MB Max Memory Used: 33 MB

Is there any debugging logging in lambda-promtail that I can enable?

elliotdobson avatar Sep 16 '22 04:09 elliotdobson

Since my last comment I figured out that AWS Network Load Balancers log timestamps are not RFC3339 compatible as they don't contain the timestamp they were recorded in. However they are in recorded in UTC.

I've created PR #7194 which fixes the regex and timestamp issues I found regarding ingesting AWS Network Load Balancers logs.

I was able to figure this out by replacing the return err lines with fmt.Println(err) so that the error was printed out in the logs. I'm not sure why the error was not being output by the lambda runtime in my case.

elliotdobson avatar Sep 19 '22 01:09 elliotdobson