pulumi-awsx icon indicating copy to clipboard operation
pulumi-awsx copied to clipboard

`TargetGroup` sometimes does not attach to `ApplicationLoadBalancer`

Open rpmccarter opened this issue 1 year ago • 1 comments

What happened?

I was trying to create a single FargateService with two different TargetGroups attached to an ApplicationLoadBalancer (one tg for HTTP requests, one tg for socket connections). When deployed, one target group simply doesn't attach to the load balancer. What's even more concerning is that, when the exact same code is deployed to a second stack, it attaches just fine. I'm relatively new to Pulumi so there might be something I'm missing, but I assumed identical code should result in identical resources.

I understand this might not be reproducible, I mostly just want to flag that I'm seeing inconsistency between environments and hopefully get some answers on how this is possible

Example

Unfortunately, this is part of our private infra so I won't be able to send the entire deploy script, but I'll try to send as much relevant info as possible. Here is the code for the target groups and load balancer:

const serverTg = new aws.lb.TargetGroup(`leaves-server-tg-${stack}`, {
  vpcId: defaultVpc.vpcId,
  stickiness: {
    type: 'lb_cookie',
  },
  port,
  protocol: 'HTTP',
  targetType: 'ip',
  protocolVersion: 'HTTP1',
  healthCheck: {
    path: '/api',
    port: 'traffic-port',
    protocol: 'HTTP',
    matcher: '200',
    enabled: true,
    interval: 60,
    timeout: 30,
  },
});

const socketTg = new aws.lb.TargetGroup(`leaves-socket-tg-${stack}`, {
  vpcId: defaultVpc.vpcId,
  port: 5001,
  protocol: 'HTTP',
  stickiness: {
    type: 'lb_cookie',
  },
  targetType: 'ip',
  protocolVersion: 'HTTP1',
  healthCheck: {
    path: '/api',
    port: `${port}`,
    protocol: 'HTTP',
    matcher: '200',
    enabled: true,
    interval: 60,
    timeout: 30,
  },
});

const lb = new awsx.lb.ApplicationLoadBalancer(`leaves-lb-${stack}`, {
  listeners: [
    {
      port: 443,
      protocol: 'HTTPS',
      certificateArn: lb_cert.arn,
      defaultActions: [
        {
          type: 'forward',
          targetGroupArn: serverTg.arn,
        },
      ],
    },
    {
      port: 8443,
      protocol: 'HTTPS',
      certificateArn: lb_cert.arn,
      defaultActions: [
        {
          type: 'forward',
          targetGroupArn: socketTg.arn,
        },
      ],
    },
  ],
});

And here's the code for the target service:

new awsx.ecs.FargateService(`leaves-server-service-${stack}`, {
  networkConfiguration: {
    assignPublicIp: true,
    securityGroups: [serviceSg.id],
    subnets: defaultVpc.publicSubnetIds,
  },
  cluster: cluster.arn,
  desiredCount: 4,
  taskDefinitionArgs: {
    taskRole: {
      roleArn: role.arn,
    },
    container: {
      name: 'server',
      image: image.imageUri,
      command: ['infisical', 'run', `--env=${stack}`, '--', 'yarn', 'server'],
      cpu: 2 * 1024,
      memory: 4 * 1024,
      environment: serverEnvironment,
      essential: true,
      portMappings: [
        {
          targetGroup: serverTg,
          containerPort: port,
        },
        {
          targetGroup: socketTg,
          containerPort: 5001,
        },
      ],
      healthCheck: {
        command: ['CMD-SHELL', `curl -f http://localhost:${port}/api/ || exit 1`],
        interval: 30,
        timeout: 5,
        retries: 3,
      },
    },
  },
});

Here are the target groups - the relevant ones are selected. Note that leaves-socket-tg-dev has no associated load balancer:

Screenshot 2024-04-03 at 4 07 27 PM

Output of pulumi about

CLI          
Version      3.112.0
Go Version   go1.22.1
Go Compiler  gc

Plugins
NAME        VERSION
aws         6.28.2
awsx        2.5.0
cloudflare  5.22.0
docker      4.5.3
docker      3.6.1
nodejs      unknown
tls         5.0.1

Host     
OS       darwin
Version  14.4
Arch     arm64

This project is written in nodejs: executable='/Users/rpmccarter/.nvm/versions/node/v20.10.0/bin/node' version='v20.10.0'

Current Stack: Mintlify/leaves/dev

TYPE                                                      URN
[removed]

Found no pending operations associated with dev

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/Mintlify
User           Mintlify
Organizations  Mintlify
Token type     personal

Dependencies:
NAME                VERSION
@pulumi/aws         6.28.2
@pulumi/awsx        2.5.0
@pulumi/cloudflare  5.22.0
@pulumi/pulumi      3.109.0
@pulumi/tls         5.0.1
@types/node         16.18.22
rimraf              5.0.5
typescript          5.3.3

Pulumi locates its logs in /var/folders/dn/z0by0dcj1gnbkjr6_t71hp_m0000gn/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

rpmccarter avatar Apr 03 '24 23:04 rpmccarter

Thanks for reporting this @rpmccarter this sounds pretty concerning. To clarify does the failed state happen sporadically or every single time? Are there no errors reported? Does the condition not resolve after a certain time (5 min later)?

This sounds pretty concerning but will be difficult for our team to diagnose so anything along the lines of narrowing down the repro would be super helpful. If anyone is running into this please let us know also what you are observing.

t0yv0 avatar Apr 05 '24 13:04 t0yv0

Any further context you can offer to help us reproduce this @rpmccarter ?

mjeffryes avatar May 20 '24 23:05 mjeffryes

Hey team, I'm fairly confident this is just a symptom of #1253. I'm just now running into a very similar issue with a Cloudflare Record failing to be created due to a missing field which is lb.loadBalancer.dnsName - closing this as a duplicate

rpmccarter avatar May 22 '24 04:05 rpmccarter