terraform-provider-google icon indicating copy to clipboard operation
terraform-provider-google copied to clipboard

P1 total data loss bug in google_storage_bucket lifecycle rule for days_since_noncurrent_time

Open philip-harvey opened this issue 9 months ago • 15 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version

Terraform v1.8.2 google v5.26.0

Affected Resource(s)

google_storage_bucket

Terraform Configuration

resource "google_storage_bucket" "somename-test" {
  name                        = "somename-test"
  project                     = projectid
  location                    = "US-CENTRAL1"
  uniform_bucket_level_access = true
  versioning {
    enabled = true
  }
  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      days_since_noncurrent_time = 0
    }
  }
}

Debug Output

No response

Expected Behavior

This should create a lifecyle rule to delete noncurrent objects after 0 days

Actual Behavior

Creates a lifecycle rule that deletes ALL object in the bucket after 0 days, causing total data loss in the bucket

Steps to reproduce

  1. terraform apply

Important Factoids

This is a P1 issue causing total data loss

References

No response

b/339089840

philip-harvey avatar Apr 30 '24 15:04 philip-harvey

Hi @philip-harvey!

According to your configuration and the description on the official documentation of the property lifecycle.rule[].condition.daysSinceNoncurrentTime if you set this value to 0 then the normal is that all resources in your bucket will be deleted after 0 days. Because all the resources are going to be recognized as noncurrent immediately while the value of the condition is equal to 0. Then there is no bug, this is a normal harassment.

ggtisc avatar May 06 '24 21:05 ggtisc

Hi @philip-harvey!

According to your configuration and the description on the official documentation of the property lifecycle.rule[].condition.daysSinceNoncurrentTime if you set this value to 0 then the normal is that all resources in your bucket will be deleted after 0 days. Because all the resources are going to be recognized as noncurrent immediately while the value of the condition is equal to 0. Then there is no bug, this is a normal harassment.

I'm not following the logic there, if you set this via the console it correctly sets the rule to delete objects 0 days after they become noncurrent, which is correct and expected, but if you set this via Terraform it makes a totally different rule, that just deletes ALL objects in the bucket, both current and noncurrent objects. The documentation doesn't call out this behavior anywhere that I can find, and it's inconsistent with how this works in the GUI, and causes total data loss.

philip-harvey avatar May 06 '24 21:05 philip-harvey

This could be a documentation update of the description of this field (lifecycle.rule[].condition.daysSinceNoncurrentTime) in the API.

ggtisc avatar May 06 '24 21:05 ggtisc

This could be a documentation update of the description of this field (lifecycle.rule[].condition.daysSinceNoncurrentTime) in the API.

I'm still not following this, the documentation says ". This condition is satisfied when an object has been noncurrent for more than the specified number of days.". The bug is that Terraform creates a lifecycle rule that does not do this, it creates a lifecycle rule that deletes ALL objects, not just noncurrent ones. As I said previously, this works correctly outside of Terraform and it's a very severe bug that causes total data loss.

philip-harvey avatar May 06 '24 21:05 philip-harvey

This is not a documentation issue, this is a bug, a very major one.

philip-harvey avatar May 06 '24 21:05 philip-harvey

I think, This issue is coming because of the provider bug that is already solved: https://github.com/hashicorp/terraform-provider-google/issues/14044 . You can try setting no_age = true in the lifecycle config block to avoid creating unexpected age rule. This is the discussion link of the solution approach:https://github.com/GoogleCloudPlatform/magic-modules/pull/9512. The problem was with terraform's way of handling zero value of the data types. That's why the solution is implemented in this way.

kautikdk avatar May 07 '24 08:05 kautikdk

https://github.com/hashicorp/terraform-provider-google/issues/14044 looks like a similar bug, but this bug is not yet fixed in the provider.

philip-harvey avatar May 07 '24 15:05 philip-harvey

Actually it looks like https://github.com/hashicorp/terraform-provider-google/issues/14044 is also not fixed

philip-harvey avatar May 07 '24 15:05 philip-harvey

It is! You can try putting no_age = true in the lifecycle rule condition block, It won't add unexpected condition of age.

kautikdk avatar May 07 '24 15:05 kautikdk

Because all the resources are going to be recognized as noncurrent immediately while the value of the condition is equal to 0.

This shouldn't be the case- objects are determined to be noncurrent if they are replaced by a newer version, per https://cloud.google.com/storage/docs/object-versioning#intro. Live objects should never be replaced by this rule (at least, reading this as a caller of the API).

I suspect this is a side effect of https://github.com/hashicorp/terraform-provider-google/issues/14044 w/o no_age set. Since Terraform things age has a value of zero, we're sending that, and there are two values in the rule. That still shouldn't delete current objects as all the rules won't apply- current objects are not noncurrent, and it can't be zero days since they became noncurrent.

Our effective rule as a result is:

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      age = 0
      days_since_noncurrent_time = 0
    }
  }

@philip-harvey do you have the REST response from a Get call on the bucket, specifically the lifecycle block? That's available in the provider's debug logs or you can use the API explorer sidebar at https://cloud.google.com/storage/docs/json_api/v1/buckets/get (or the dozen other ways to call the APIs). Terraform state will also probably be correct, but its inability to handle nil values can make it more confusing than just REST in these zero-versus-nil cases.

rileykarson avatar May 07 '24 16:05 rileykarson

{
  "kind": "storage#bucket",
  "selfLink": "https://www.googleapis.com/storage/v1/b/redacted",
  "id": "redacted",
  "name": "redacted",
  "projectNumber": "redacted",
  "metageneration": "6",
  "location": "US-CENTRAL1",
  "storageClass": "STANDARD",
  "etag": "CAY=",
  "timeCreated": "2024-04-30T15:41:26.753Z",
  "updated": "2024-04-30T15:55:46.810Z",
  "versioning": {
    "enabled": true
  },
  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "Delete"
        },
        "condition": {
          "age": 0
        }
      }
    ]
  },
  "softDeletePolicy": {
    "retentionDurationSeconds": "604800",
    "effectiveTime": "2024-04-30T15:41:26.753Z"
  },
  "iamConfiguration": {
    "bucketPolicyOnly": {
      "enabled": true,
      "lockedTime": "2024-07-29T15:41:26.753Z"
    },
    "uniformBucketLevelAccess": {
      "enabled": true,
      "lockedTime": "2024-07-29T15:41:26.753Z"
    },
    "publicAccessPrevention": "inherited"
  },
  "locationType": "region"
}

philip-harvey avatar May 07 '24 16:05 philip-harvey

Thanks! That's much worse, we're not sending days_since_noncurrent_time at all when it has a value of zero. Combined with age which gets sent when it has a value of zero, it leads to this result where all objects get deleted.

(fyi: I edited your comment to add a code fence, to make it easier to read)

rileykarson avatar May 07 '24 16:05 rileykarson

Thanks! That's much worse, we're not sending days_since_noncurrent_time at all when it has a value of zero. Combined with age which gets sent when it has a value of zero, it leads to this result where all objects get deleted.

(fyi: I edited your comment to add a code fence, to make it easier to read)

I wondered if you send both age = 0 and days_since_noncurrent_time = 0 if the API just ignores the days_since_noncurrent_time = 0 part since it's redundant, but I haven't verified this.

philip-harvey avatar May 07 '24 17:05 philip-harvey

That works fine:

  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "Delete"
        },
        "condition": {
          "age": 0,
          "daysSinceNoncurrentTime": 0
        }
      }
    ]
  },

rileykarson avatar May 07 '24 17:05 rileykarson

@philip-harvey we're shipping a docs change immediately while we figure out the longer-term fix here (@NickElliot and @kautikdk are discussing offline). Mind giving https://github.com/GoogleCloudPlatform/magic-modules/pull/10626 a once-over to see if you think it would have helped in your case / make any suggestions if not?

rileykarson avatar May 08 '24 17:05 rileykarson

This issue really isn't fixed, this is a work around for the issue. The age should default to null instead of 0, it's still very likely people will accidently wipe their buckets because of this bug.

philip-harvey avatar Jun 03 '24 19:06 philip-harvey

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Jul 07 '24 02:07 github-actions[bot]