terraform-provider-databricks icon indicating copy to clipboard operation
terraform-provider-databricks copied to clipboard

[ISSUE] Issue with `databricks_mws_workspaces` resource

Open devlucasc opened this issue 3 weeks ago • 2 comments

Configuration

resource "databricks_mws_workspaces" "this" {
  provider                                 = databricks.accounts
  account_id                               = ...
  aws_region                               = ....
  workspace_name                           = ...
  deployment_name                          = ....
  pricing_tier                             = "ENTERPRISE"
  expected_workspace_status                = "PROVISIONING"
  credentials_id                           = databricks_mws_credentials.this.credentials_id
  storage_configuration_id                 = databricks_mws_storage_configurations.this.storage_configuration_id
  network_id                               = databricks_mws_networks.this.network_id
  managed_services_customer_managed_key_id = databricks_mws_customer_managed_keys.managed_services.customer_managed_key_id
  private_access_settings_id               = databricks_mws_private_access_settings.pas.private_access_settings_id
}

Expected Behavior

  • Update the doc to reflect that this feature works for AWS too.
  • It should be possible to plan without expect "PROVISIONING" as status when workspace is already created.
  • It should be possible to expose a variable to control if the user would like to verify reachability or not. Allowing to skip the call of function here https://github.com/databricks/terraform-provider-databricks/blob/main/mws/resource_mws_workspaces.go#L207

Actual Behavior

The field expected_workspace_status is specified on documentation as exclusive for GCP but it is affecting AWS as well. Subsequent plans will fail because the workspace status is now RUNNING. So it will get this following error Error: cannot read mws workspaces: Workspace is running.. But when it is set to expect the workspace status as RUNNING, the provider will try to check workspace reachability and may can fail due to to DNS caching or if it is using private links without connectivity. It makes sense to expect running state to apply resources for workspace level, but when is only applying account level resources, it is not necessary and it is not possible to control this behavior. Then, it will fail after the default timeout of 20 minutes. In this case, to retry, it is necessary to delete the workspace manually to be recreated or by importing it to the state to avoid resource conflicting.

Steps to Reproduce

  • terraform plan

Terraform and provider versions

terraform 1.7.5 databricks/databricks 1.97.0

Is it a regression?

Debug Output

Important Factoids

Would you like to implement a fix?

devlucasc avatar Dec 09 '25 00:12 devlucasc

I could do a workaround to avoid waiting for reachability by setting the expected as PROVISIONING and addind time sleep resources as this


resource "time_sleep" "wait_workspace" {
  depends_on = [
    databricks_mws_workspaces.this
  ]
  triggers = {
    workspace_id = databricks_mws_workspaces.this.workspace_id
  }
  create_duration = "300s"
}

resource "databricks_metastore_assignment" "this" {
  provider     = databricks.accounts
  metastore_id = ....
  workspace_id = databricks_mws_workspaces.this.workspace_id

  depends_on = [
    time_sleep.wait_workspace
  ]
}

But it still failing for subsequent plans and maybe is not a good way to control the resource creation because this interval may vary

devlucasc avatar Dec 09 '25 00:12 devlucasc

Can someone review and contribute with this PR fix: Accept running as valid wait state #5268? Thanks

devlucasc avatar Dec 09 '25 01:12 devlucasc

Adding more detailed information:

Summary

The Read function for the workspace resource always calls WaitForExpectedStatus, regardless of cloud provider or resource usage. This creates a series of problems when the workspace is already in RUNNING status but Terraform does not actually need to reach the workspace (e.g., blueprints that only provision account-level resources).

Because the provider unconditionally performs a reachability check whenever the expected status is RUNNING, Terraform can fail even when the workspace is fully deployed and healthy in the Accounts UI.


What the provider currently does

1. Read always calls WaitForExpectedStatus

expectedStatus := d.Get("expected_workspace_status").(string)
err = workspacesAPI.WaitForExpectedStatus(workspace, expectedStatus, d.Timeout(schema.TimeoutRead))

There is no cloud-specific logic such as:

if gcp {
  WaitForExpectedStatus(...)
}

So the wait-and-reachability logic runs on AWS, GCP, and Azure equally.


2. When expected_workspace_status = RUNNING, reachability is enforced

Inside WaitForExpectedStatus:

case expectedStatus:
    if expectedStatus == WorkspaceStatusRunning {
        return a.verifyWorkspaceReachable(workspace)
    }
    return nil

Meaning:

  • If the workspace reaches RUNNING,
  • The provider still calls verifyWorkspaceReachable, regardless of whether reachability is needed for the Terraform plan.

This is a problem when the Terraform runner cannot yet resolve the workspace DNS (e.g., stale DNS cache).


The problem

When the workspace is already RUNNING in the Accounts UI:

  • The provider reaches the success branch…
  • …but verifyWorkspaceReachable fails due to DNS cache or network conditions.
  • This causes the whole Read to fail with a timeout, even though the workspace is actually ready and healthy.

As a result:

Terraform state becomes inconsistent

  • Terraform believes the workspace failed or does not exist.

  • The workspace does exist in Databricks Accounts.

  • You must either:

    • manually import the workspace into state or
    • delete the workspace to recreate it and avoid “workspace already exists” conflicts.

Why setting expected_workspace_status = PROVISIONING is not a workaround

If I set:

expected_workspace_status = "PROVISIONING"

then the API returns the expected state:

workspace.WorkspaceStatus = "RUNNING"

But since RUNNING != PROVISIONING, the provider falls into the default branch:

default:
    return resource.RetryableError(...)

It retries until timeout and still fails.

So:

  • RUNNING expected → fails due to reachability
  • PROVISIONING expected → fails because RUNNING is “unexpected”

No configuration avoids the failure.


Why this is incorrect behavior

  • Many Terraform configurations only manage account-level resources.
  • In these cases, workspace reachability should never be required.
  • A DNS cache lasting longer than Terraform’s timeout should not break workspace creation.
  • WaitForExpectedStatus is enforced on all cloud providers and all use cases, so it is not exclusive for GCP as the doc says.

Proposed Fixes (any of the following would resolve the issue)

✔ Option A — Skip reachability check for account-level-only use cases

verifyWorkspaceReachable should not run when no workspace-level resources are referenced.

✔ Option B — Make reachability optional via a provider flag

Example:

provider "databricks" {
  skip_workspace_reachability = true
}

✔ Option C — Treat RUNNING as a valid “higher-than” state when expected is PROVISIONING

This would prevent unnecessary retries.


Impact

This behavior:

  • Breaks automations that create workspaces from runners with DNS caching.
  • Causes false negatives in Terraform plans.
  • Forces manual recovery via import or workspace deletion.
  • Leads to significant friction in IaC workflows.

Environment

  • Terraform runner with DNS cache older than Databricks workspace propagation time.
  • Provider version: 1.97.0
  • Cloud: AWS (but issue affects all providers)

devlucasc avatar Dec 10 '25 23:12 devlucasc

Hi @devlucasc

Thank you for reaching out and for the detailed investigation and explanation. Thank you also for the PR contributing to the solution.

However, we do think that the PR could have unwanted side effects, so we are discussing internally whether a different approach is needed.

We will reach back once we have an decision.

hectorcast-db avatar Dec 11 '25 15:12 hectorcast-db