pulumi-aws icon indicating copy to clipboard operation
pulumi-aws copied to clipboard

SQS QueuePolicy update strategy is incorrect

Open SeanSanker opened this issue 2 years ago • 12 comments

What happened?

I recently changed the binding of a parent/child relationship for an aws.sqs.QueuePolicy. When it was updated, it created a replacement and deleted the old one.

Turns out that they’re not distinct entities, they’re just a property of the aws.sqs.Queue.

That means creating a replacement then deleting the prior essential just deleted the property all together, thus breaking our workflow and losing a few hours of messages for the Queue.

This seems to be a bug in the update strategy.

Steps to reproduce

  1. Create a Queue and an attached QueuePolicy.
  2. Deploy them
  3. Rename the QueuePolicy and/or change the { parent } property
  4. Deploy again

Expected Behavior

The QueuePolicy is updated with the new definition, be it in-place, or by a create/delete replacement strategy.

Actual Behavior

  1. During reproduction steps 1-2, the entities are created.
  2. During reproduction step # 4
    1. The strategy chosen is to replace: create the new QueuePolicy, then delete the old one.
    2. The new QueuePolicy is created
    3. The old QueuePolicy is deleted (which is the exact same entity as the new one)
    4. The Queue ends up with no QueuePolicy at all.

Versions used

CLI Version 3.30.0 Go Version go1.18.1 Go Compiler gc

Plugins NAME VERSION nodejs unknown

Host OS darwin Version 12.3.1 Arch arm64

Additional context

The resolution ended up being:

  1. Run pulumi refresh so it detected that the policy had been deleted
  2. Run pulumi up to re-create the entity/property.

Slack discussion here: https://pulumi-community.slack.com/archives/C84L4E3N1/p1650636402241399

The suggestion from @jaxxstorm was to use an alias. See below: ...when you say you changed the parent child relationship, you mean as a resource option? If so, this is expected behaviour and you can use an alias to prevent a replacement

I don't think that using an alias is the correct solution here.

An alias should be used when we want to retain an existing entity for some intentional reason. I think that's a workaround, but not a solution to the core issue that I see: QueuePolicy has a faulty update strategy as it identifies a QueuePolicy as a separate entity rather than an attribute of a Queue

The issue is that the default contract of pulumi's functionality: changing code then performing a deploy should update the state of the environment without breaking it is broken due to this issue.

Docs on policies for SQS Queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-authentication-and-access-control.html#access-control Amazon SQS has its own resource-based permissions system that uses policies written in the same language used for AWS Identity and Access Management (IAM) policies. This means that you can achieve similar things with Amazon SQS policies and IAM policies.

Docs on setting the Policy attribute for a Queue: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_SetQueueAttributes.html

To confirm for yourself that it's a property run the following on a Queue with a Policy set: aws sqs get-queue-attributes --queue-url [some url with a policy] --attribute-names Policy

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

SeanSanker avatar Apr 25 '22 15:04 SeanSanker

I'd also be more than happy to help fix this, but I don't know where to start or how these strategies are defined/implemented. Let me know if someone wants to hop on a call and work through this with me and I'll try and resolve these kinds of issues myself (if possible) in the future.

SeanSanker avatar Apr 25 '22 15:04 SeanSanker

@SeanSanker digging in to this, I think I understand the surface area here and have an explanation for what happened. This first comment will just just attempt to describe the status quo, as I'm not saying the behavior you saw is desirable, it's never great to lose data.

Looking at queuePolicy, there's only one way to rename it, which would be to change that first argument to the resource, what we call a "logical name".

new aws.sqs.QueuePolicy("first-name", 
                       # ^ logical name
  { /* args */ },
  { parent: /* some parent resource */ },
);

I hope I can describe the engine's present behavior and scratch at the surface of why it occurs like this though, from the Resource Names docs:

Each resources in Pulumi has a logical name and a physical name. The logical name is how the resource is known inside Pulumi, and establishes a notion of identity within Pulumi even when the physical resource might need to change (for example, during a replacement). The physical name is the name used for the resource in the cloud provider that a Pulumi program is deploying to.

Pulumi auto-names most resource by default, using the logical name and a random suffix to construct a unique physical name for a resource. Users can provide explicit names to override this default.

Each resource also has a Uniform Resource Name (URN) which is a unique name derived from both the logical name of the resource and the type of the resource and, in the case of components, its parents.

Consider this situation:

pulumi up on this:

const queue1 = new aws.sqs.Queue("first-queue");
new aws.sqs.QueuePolicy("first-queue-policy", { 
  queueUrl: queue1.id, /* args */ 
});

Then we change the program like so:

 const queue1 = new aws.sqs.Queue("first-queue");
-new aws.sqs.QueuePolicy("first-queue-policy", {
+new aws.sqs.QueuePolicy("new-name-for-first-queue-policy", {
   queueUrl: queue1.id, /* args */ 
 });
+
+// A second queue added in the same program:
+const queue2 = new aws.sqs.Queue("second-queue");
+new aws.sqs.QueuePolicy("second-queue-policy", { queueUrl: queue2.id, /* args */ });

The question the engine has to ask the provider:

  1. Do we delete first-queue-policy
  2. Do we update first-queue-policy using the args of new-name-for-first-queue-policy
  3. Do we update first-queue-policy using the args of second-queue-policy

To support all 3 options, you could:

  1. Run the second program as-is. This is the default behavior.
  2. To rename the first queue policy's logical name without deletion, use the alias resource option:
 const queue1 = new aws.sqs.Queue("first-queue");
new aws.sqs.QueuePolicy("new-name-for-first-queue-policy", {
    queueUrl: queue1.id, /* args */ 
  }, {
   aliases: [{ name: 'first-queue-policy' }],
  });
// ... second queue as-is
  1. And likewise for the 3rd case:
new aws.sqs.QueuePolicy("second-queue-policy", { queueUrl: queue2.id, /* args */ }, {
  aliases: [{ name: 'first-queue-policy });

AaronFriel avatar Jul 14 '22 20:07 AaronFriel

@SeanSanker For this second comment, you asked during the community forum whether you could contribute to fix this. I'm sad to say this would be a heck of a change for a first time contributor. I'm not sure if it's impossible though, as you note this is the surprising behavior:

  1. The new QueuePolicy is created
  2. The old QueuePolicy is deleted (which is the exact same entity as the new one)
  3. The Queue ends up with no QueuePolicy at all.

That's very very tricky. "QueuePolicy" is kind of a strange resource, (EDITED) our oldest issues describe this as a "structural resources". It doesn't have a name and it shouldn't really be tracked by its name. It's something we've inherited from another, unnamed infrastructure as code tool. :) It's not really a distinct resource, it's just an alias for a property or set of properties on another resource.

I think that this comment and the many 👍 reactions to it exemplifies why these sorts of resources behave in surprising ways. One way we're fixing this in our native providers is by not modeling these pseudo-resources. As you note, a "Queue" and its "QueuePolicy" are really the same thing, and they should be managed together as a unit. We're doing that in our native provider, aws-native, documented here:

https://www.pulumi.com/registry/packages/aws-native/api-docs/sqs/queue/

We've seen this behavior before with other (EDITED) structural resources, and I wonder if we can improve on this by more eagerly deleting resources safely. I'm going to take this to the team to ask some deeper questions and get back to you.

AaronFriel avatar Jul 14 '22 20:07 AaronFriel

@Frassle got back to me quite quickly and confirmed what I thought. There's a deep vein of related issues we want to tackle and are mulling over the right approach. The oldest issues we've identified are these:

  • https://github.com/pulumi/pulumi/issues/450
  • https://github.com/pulumi/pulumi/issues/918

And more recently and in this repo:

  • https://github.com/pulumi/pulumi-aws/issues/2009
  • https://github.com/pulumi/pulumi/pull/9903
  • https://github.com/pulumi/pulumi/issues/9925

I'm linking to these to ensure that we appropriately cross-reference this with these other issues.

AaronFriel avatar Jul 14 '22 20:07 AaronFriel