terraform-provider-ec icon indicating copy to clipboard operation
terraform-provider-ec copied to clipboard

A plan with multiple `elasticsearch.topology` blocks is not idempotent

Open kuisathaverat opened this issue 3 years ago • 11 comments

After executing a terraform plan successfully, if you apply the same plan it failed with an error related to node types id not present.

Note: the use case is to update the plan before apply again but it fails in the basic case of no changes in the plan. Note: A plan with only hot_content nodes does not fail.

Readiness Checklist

  • [X] I am running the latest version
  • [X] I checked the documentation and found no answer
  • [X] I checked to make sure that this issue has not already been filed
  • [X] I am reporting the issue to the correct repository (for multi-repository projects)

Expected Behavior

The second execution of terraform apply should not fail, and apply no changes.

Current Behavior

The second execution of terraform failed with the following error:

2021-07-06T20:04:59.274+0200 [TRACE] dag/walk: upstream of "root" errored, so skipping
2021-07-06T20:04:59.274+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2021-07-06T20:04:59.275+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 4
2021-07-06T20:04:59.275+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Error: failed updating deployment: 9 errors occurred:
	* api error: clusters.cluster_invalid_plan: Instance configuration [aws.coordinating.m5d] does not allow usage of node type [ml]. You must either change instance configuration or use only allowed node types [ingest]. (resources.elasticsearch[0].cluster_topology[0].instance_configuration_id)
	* api error: clusters.cluster_invalid_plan: Instance configuration [aws.master.r5d] does not allow usage of node type [data]. You must either change instance configuration or use only allowed node types [master]. (resources.elasticsearch[0].cluster_topology[5].instance_configuration_id)
	* api error: clusters.cluster_invalid_plan: Instance configuration [aws.ml.m5d] does not allow usage of node type [master]. You must either change instance configuration or use only allowed node types [ml]. (resources.elasticsearch[0].cluster_topology[6].instance_configuration_id)
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = cold] (resources.elasticsearch[0])
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = coordinating] (resources.elasticsearch[0])
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = hot_content] (resources.elasticsearch[0])
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = master] (resources.elasticsearch[0])
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = ml] (resources.elasticsearch[0])
	* api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = warm] (resources.elasticsearch[0])



  with ec_deployment.main,
  on main.tf line 33, in resource "ec_deployment" "main":
  33: resource "ec_deployment" "main" {

2021-07-06T20:04:59.300+0200 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
2021-07-06T20:04:59.300+0200 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock
2021-07-06T20:04:59.302+0200 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2021-07-06T20:04:59.306+0200 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/elastic/ec/0.2.1/darwin_amd64/terraform-provider-ec_v0.2.1 pid=41926
2021-07-06T20:04:59.306+0200 [DEBUG] provider: plugin exited

## Terraform definitionm

Steps to Reproduce

  1. Create a main.tf file like the following main.tf
terraform {
  required_version = ">= 0.12.29"

  required_providers {
    ec = {
      source  = "elastic/ec"
      version = "0.2.1"
    }
  }
}

data "vault_generic_secret" "oblt" {
  path = "secret/my-secret"
}

provider "ec" {
  # ECE installation endpoint
  endpoint = "https://ess.example.com"

  # If the ECE installation has a self-signed certificate
  # setting "insecure" to true is required.
  insecure = true

  # APIKey is the recommended authentication mechanism. When
  # Targeting the Elasticsearch Service, APIKeys are the only
  # valid authentication mechanism.
  apikey = data.vault_generic_secret.oblt.data["apiKey"]

  verbose = false
}

# Create an Elastic Cloud deployment
resource "ec_deployment" "main" {
  name = "oblty-test-ddll"

  # Mandatory fields
  region                 = "us-east-1"
  version                = "7.13.1"
  deployment_template_id = "aws-io-optimized-v2"

  elasticsearch {
    topology {
      id = "hot_content"
      zone_count = 1
      size = "4g"
    }
    topology {
      id   = "warm"
      size = "2.0g"
    }
    topology {
      id   = "cold"
      size = "2.0g"
    }
    topology {
      id   = "ml"
      zone_count = 1
      size = "2.0g"
    }
    topology {
      id   = "coordinating"
      zone_count = 1
      size = "2.0g"
    }
    topology {
      id   = "master"
      zone_count = 1
      size = "2.0g"
    }
  }

  kibana {
    topology {
      size = "2g"
    }
  }

  apm {
    topology {
      zone_count = 1
      size = "1g"
    }
  }
}

  1. execute terraform init
  2. execute terraform apply -auto-approve=true
  3. after theterraform plan success execute terraform apply -auto-approve=true again
  4. the plan will fail with the previously described error

Context

We plan to make daily updates over the plan, the current version failed in case you have changes in the plan and in case you do not have changes. This makes not possible to migrate to terraform-provider-ec from the current tool we use (ecl).

Possible Solution

Your Environment

  • Version used: 0.2.1
  • Running against Elastic Cloud SaaS or Elastic Cloud Enterprise and version: 2.9.0-SNAPSHOT
  • Environment name and version (e.g. Go 1.9): 1.16
  • Server type and version: staging
  • Operating System and version: masOS 11.4
  • Link to your project:

kuisathaverat avatar Jul 06 '21 18:07 kuisathaverat

The issue start happens when you have three or more nodes

terraform {
  required_version = ">= 0.12.29"

  required_providers {
    ec = {
      source  = "elastic/ec"
      version = "0.2.1"
    }
  }
}

data "vault_generic_secret" "oblt" {
  path = "secret/my-secret"
}

provider "ec" {
  endpoint = "https://ess.example.com"
  insecure = true
  apikey = data.vault_generic_secret.oblt.data["apiKey"]

  verbose = false
}

# Create an Elastic Cloud deployment
resource "ec_deployment" "main" {
  name = "oblty-test-ddll"

  # Mandatory fields
  region                 = "us-east-1"
  version                = "7.13.1"
  deployment_template_id = "aws-io-optimized-v2"

  elasticsearch {
    topology {
      id = "hot_content"
      zone_count = 1
      size = "4g"
    }
    topology {
      id   = "warm"
      size = "2.0g"
    }
# Add one node more causes the issue
    topology {
      id   = "cold"
      size = "2.0g"
    }
  }

  kibana {
    topology {
      size = "2g"
    }
  }

  apm {
    topology {
      zone_count = 1
      size = "1g"
    }
  }
}

kuisathaverat avatar Jul 07 '21 08:07 kuisathaverat

@kuisathaverat sorry for the late reply here, I was on PTO and had a couple of high priority things to work on this week, I'll take a look at this bug as soon as I can.

marclop avatar Jul 16 '21 15:07 marclop

no rush, we have to workaround it with the autoscale setting, so we do not touch the infra ever only the Elastic Stack version or other settings. BTW it works like a charm.

kuisathaverat avatar Jul 16 '21 15:07 kuisathaverat

@kuisathaverat I found what had been happening for you, let me try and explain it to the best of my ability.

Like I explained through Slack, I believe that what you're seeing is due to the topology elements not being specified in alphabetical order, which is a current limitation of the provider, since we’re using the schema.TypeList for the elasticsearch.topology block, more details below.

There’s a few caveats to the way that we define our topology elements to be able to work around a few limitations of the current Terraform SDK, but since we’re using the deployment templates as the initial configuration, then allowing users to override certain fields on top of that configuration, we have a lot of “computed” and “optional” fields in the schema, since the value can either be calculated by what the API returns, or what the user directly specifies.

Computed and optional fields are great to accommodate the flexibility that our system exposes, but they have some problems, for example, they do not play well when they are nested within a schema.TypeSet, which the type that we would need to use for the elasticsearch.topology block. This is due to the fact that the schema.TypeSet uses a hashing function to determine equality of two sets, and is unable to compute that hash when the computed fields are used (https://github.com/hashicorp/terraform-plugin-sdk/issues/195).

Computed and optional fields have many more problems than what was just described and are inherently flawed due to the fact that the Terraform framework is unable to reconcile where the value came from. Terraform has no way to know where the value that is desired came from (state persistence or user input), which has to do with the way that it reconciles and generates the current and saved state. There are a few open issues that describe and dig deep into this behavior (https://github.com/hashicorp/terraform/issues/29028, https://github.com/hashicorp/terraform-plugin-sdk/issues/413, https://github.com/hashicorp/terraform-plugin-sdk/issues/261), and have prompted Hashicorp and the Terraform maintainer team to start a new Terraform framework project (https://github.com/hashicorp/terraform-plugin-framework) that aims to solve some of these problems if not all of them, however, the project is still in its infancy and cannot be used for a project that is already public and that our customers are already using.

Wrapping up, this has a few implications for us:

  • We have to maintain the current schema.TypeList for the elasticsearch.topology block and request that our consumers specify their topology blocks in alphabetical order.
  • To mitigate the ordering issue, we might look into adding some validation to the plan reconciliation and throw out an error message if the elasticsearch.topology blocks are not properly sorted. Unfortunately, there are few options to do that and the native ValidateFunc is ignored on composite types such as schema.TypeList.
  • We’ll keep an eye on the new Terraform framework and might try to switch over to it once it's mature enough and we can ensure that it won’t disrupt our users in a significant way.

marclop avatar Jul 22 '21 08:07 marclop

I found a case that even so you pass the elasticsearch.topology in alphabetic order the update fails with the same error.

terraform {
  required_version = ">= 0.12.29"

  required_providers {
    ec = {
      source  = "elastic/ec"
      version = "0.2.1"
    }
  }
}

provider "ec" {
  # ECE installation endpoint
  endpoint = "my-endpoint.example.com"

  # If the ECE installation has a self-signed certificate
  # setting "insecure" to true is required.
  insecure = true

  # APIKey is the recommended authentication mechanism. When
  # Targeting the Elasticsearch Service, APIKeys are the only
  # valid authentication mechanism.
  apikey = "FOOOOO"

  verbose = false
}

# Create an Elastic Cloud deployment
resource "ec_deployment" "main" {
  name = "my-cluster"

  # Mandatory fields
  region                 = "gcp-us-central1"
  version                = "7.15.0"
  deployment_template_id = "gcp-io-optimized-v2"

  elasticsearch {
    autoscale = true
    topology {
      id   = "cold"
      size = "2g"
    }
    topology {
      id = "hot_content"
      zone_count = 2
      size = "4g"
    }
    topology {
      id   = "ml"
      size = "1g"
    }
    topology {
      id   = "warm"
      size = "2g"
    }
  }

  kibana {
    topology {
      size = "2g"
    }
  }

  apm {
    topology {
      zone_count = {{ ec_zones }}
      size = "1g"
    }
  }
}
fatal: [localhost]: FAILED! => changed=false 
  command: terraform apply -no-color -input=false -auto-approve=true -lock=true /var/folders/rd/4lpsq9tn2h35ty0nkl9g4hlh0000gn/T/tmpllcrs8s3.tfplan
  msg: |-
    Failure when executing Terraform command. Exited 1.
    stdout: ec_deployment.main: Modifying... [id=e94f6e6832bf4487b02685c15e13c1fd]
  
    stderr:
    Error: failed updating deployment: 1 error occurred:
            * api error: clusters.cluster_invalid_plan: Instance configuration [gcp.ml.1] does not allow usage of node type [data]. You must either change instance configuration or use only allowed node types [ml]. (resources.elasticsearch[0].cluster_topology[6].instance_configuration_id)
  
  
  
      with ec_deployment.main,
      on main.tf line 29, in resource "ec_deployment" "main":
      29: resource "ec_deployment" "main" {

kuisathaverat avatar Sep 20 '21 16:09 kuisathaverat

Hi @marclop

I have a similar issue, still related to the use of schema.TypeList for the elasticsearch.topology block as far as I can tell. When modifying a topology by adding for instance a master this generates an error if the topologies are in alphabetical order. As an example, updating this topology:

topology {
  id = "hot_content"
  size = "2g"
  zone_count = 2
}
topology {
  id = "warm"
  size = "2g"
  zone_count = 2
}

to:

topology {
  id = "hot_content"
  size = "2g"
  zone_count = 3
}
topology {
  id = "master"
  size = "1g"
  zone_count = 3
}
topology {
  id = "warm"
  size = "2g"
  zone_count = 3
}

fails with the following error:

Error: failed updating deployment: 3 errors occurred: * api error: cluster.missing_dedicated_master: Deployment template [I/O Optimized] requires a dedicated master after [6] nodes. Found [9] nodes in the deployment (resources.elasticsearch[0]) * api error: clusters.cluster_invalid_plan: Instance configuration [azure.master.e32sv3] does not allow usage of node type [data]. You must either change instance configuration or use only allowed node types [master]. (resources.elasticsearch[0].cluster_topology[5].instance_configuration_id) * api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The node_roles in the plan contains values not present in the template. [id = master] (resources.elasticsearch[0])

The only way I've found to make this work is to apply this topology:

topology {
  id = "hot_content"
  size = "2g"
  zone_count = 3
}
topology {
  id = "warm"
  size = "2g"
  zone_count = 3
}
topology {
  id = "master"
  size = "1g"
  zone_count = 3
}

where master is now at the end of the list, after all the existing entries. However after applying this successfully, I end up having the issue mentioned in the previous comments and if I run terraform plan I get a new plan to shift my toplogies around:

  ~ elasticsearch {
        # (7 unchanged attributes hidden)

      ~ topology {
          ~ id                        = "master" -> "warm"
          ~ size                      = "1g" -> "2g"
            # (5 unchanged attributes hidden)
        }
      ~ topology {
          ~ id                        = "warm" -> "master"
          ~ size                      = "2g" -> "1g"
            # (5 unchanged attributes hidden)

            # (1 unchanged block hidden)
        }

so to fix this, I have to shift my topology entries back into the right alphabetical order and that will now give me the clean empty plan that I expect.

topology {
  id = "hot_content"
  size = "2g"
  zone_count = 3
}
topology {
  id = "warm"
  size = "2g"
  zone_count = 3
}
topology {
  id = "master"
  size = "1g"
  zone_count = 3
}

Which is not the best user experience unfortunately. Also would this really work when scaling down topologies, as removing entries might cause other issues?

I understand from all the comments above that schema.TypeSet is not an option because of all the limitations that come with it, but what about instead using schema.TypeMap with the topology id as key? Do you think that could work?

Apologies if this is not quite the same issue as the one described in the previous comments, it's somewhere between this and issue 415. I have posted it here as the cause of all of them is the same as you have explained very clearly in your comments. If you want I can create a separate issue with all this.

luigibk avatar Jan 20 '22 13:01 luigibk

I think the safest way is to always specify all topology elements in alphabetical order and setting the ones that are not required to size=0g.

This requires a code change though: It works reliably for me after removing these lines that remove empty topology elements from state: https://github.com/elastic/terraform-provider-ec/blob/master/ec/ecresource/deploymentresource/elasticsearch_flatteners.go#L122

Edit: This solves some, but not all issues. This plugin needs to be ported to Terraform Plugin Framework so this can be completely fixed.

pascal-hofmann avatar Aug 09 '22 09:08 pascal-hofmann

Is it expected behavior that terraform plan/apply will still list 0g topologies? I find it confusing as it implies a change to the infrastructure that is not going to happen.

Example:

❯ terraform plan 

// ..

+ elasticsearch {
          + autoscale      = "false"
          // ...

          + topology {
              + config                    = (known after apply)
              + id                        = "cold"
              // ...
              + size                      = "0g"
              + size_resource             = "memory"
              + zone_count                = (known after apply)

              + autoscaling {
                  // ...
                }
            }

timdhoffmann avatar Sep 01 '22 11:09 timdhoffmann

Is it expected behavior that terraform plan/apply will still list 0g topologies? I find it confusing as it implies a change to the infrastructure that is not going to happen.

Example:

❯ terraform plan 

// ..

+ elasticsearch {
          + autoscale      = "false"
          // ...

          + topology {
              + config                    = (known after apply)
              + id                        = "cold"
              // ...
              + size                      = "0g"
              + size_resource             = "memory"
              + zone_count                = (known after apply)

              + autoscaling {
                  // ...
                }
            }

The provider omits topology elements with zero size when it reads state from backend. However configuration can specify a tier with zero size. It will lead to the tier deletion. The "more intuitive" approach with omitting deleted tier cannot be used - the provider uses TypeList for the topology attribute (as describe here) so a tier is omitted, all remaining elements are "lifted".

dimuon avatar Sep 05 '22 13:09 dimuon

Thanks for your reply but I'm afraid I still don't clearly understand if what I am observing is expected or not.

I have configuration that specifies multiple topologies, e.g. hot_content. and warm. The former is set to 1g size, and is already provisioned accordingly. The latter is set to 0g size and is not provisioned.

resource "ec_deployment" "elasticsearch_main" {
  # ...
  elasticsearch {
    autoscale = "true"
    topology {
      id         = "hot_content"    # This topology is already provisioned. 'terraform plan' shows no changes to be made.
      size       = "1g"
    }
    topology {
      id         = "warm"
      size       = "0g"                  # This topology is not yet provisioned and I am expecting 'terraform plan' to show no changes to be made, either, because it is set to '0g'.
    }

When I run terraform plan it doesn't show any changes for the hot_content topology. That is expected. The configuration matches what is already provisioned. However, it shows a plan to provision the warm, 0g topology:

❯ terraform plan 

// ..

+ elasticsearch {
          + autoscale      = "false"
          // ...

          + topology {
              + config                    = (known after apply)
              + id                        = "warm"
              // ...
              + size                      = "0g"
              + size_resource             = "memory"
              + zone_count                = (known after apply)

              + autoscaling {
                  // ...
                }
            }

That is not expected by me. I don't have any warm topology provisioned yet, and I am expecting that plan to ignore that topology because my configuration is set to 0g.

Could you help me understand if what I am observing is expected or not? Will terraform plan always list 0g topologies, that are not yet provisioned, as new resources that it "thinks it will" provisioned, despite the fact that it will not actually provision those resources?

timdhoffmann avatar Sep 05 '22 14:09 timdhoffmann

TL;DR - it's expected.

When 'terraform planis executed, TF calls the providerRead` function that gets actual deployment state. Then TF calculates the diff between config and state.

The provider uses TypeList for the topology attribute type so TF compares the topology list as a whole including elements order. Basically there is no chance for the provider to check that a specific entry of the list was modified (actually there is CustomizeDiff attribute that could be helpful but it cannot be effectively used in this case due to some TF implementation details).

As a rule of thumb, it makes sense to omit empty (zero sized) topology elements if autoscaling is disabled. Then, once you need to delete a tier, you can set its size to zero and after successful terraform apply remove the tier from the config.

We hope to fix the behavior with the move to Terraform Plugin Framework.

dimuon avatar Sep 05 '22 15:09 dimuon

I just stumbled on this as well with 0.4.1 and it's a big blocker to use the provider, both for testing purposes and production use.

The provider uses TypeList for the topology attribute type

I guess topology can only have 1 entry per id ("hot_content ", "warm", etc...), so could we make this attribute a TypeMap?

EDIT: actually TypeSet instead of TypeMap

The schema reference could be map[string]topology with the id being the key, an thus the comparison on resource reads would be matching. WDYT?

inge4pres avatar Dec 01 '22 13:12 inge4pres

I just stumbled on this as well with 0.4.1 and it's a big blocker to use the provider, both for testing purposes and production use.

The provider uses TypeList for the topology attribute type

I guess topology can only have 1 entry per id ("hot_content ", "warm", etc...), so could we make this attribute a TypeMap?

EDIT: actually TypeSet instead of TypeMap

The schema reference could be map[string]topology with the id being the key, an thus the comparison on resource reads would be matching. WDYT?

It looks like TypeSet is not an option - https://github.com/elastic/terraform-provider-ec/issues/336#issuecomment-884749336. Anyway we hope to fix it with the migration to the TF Framework.

dimuon avatar Dec 02 '22 15:12 dimuon

Closed by https://github.com/elastic/terraform-provider-ec/pull/567

dimuon avatar Mar 01 '23 15:03 dimuon