[ISSUE] Issue with `databricks_mws_workspaces` resource
Configuration
resource "databricks_mws_workspaces" "this" {
provider = databricks.accounts
account_id = ...
aws_region = ....
workspace_name = ...
deployment_name = ....
pricing_tier = "ENTERPRISE"
expected_workspace_status = "PROVISIONING"
credentials_id = databricks_mws_credentials.this.credentials_id
storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
network_id = databricks_mws_networks.this.network_id
managed_services_customer_managed_key_id = databricks_mws_customer_managed_keys.managed_services.customer_managed_key_id
private_access_settings_id = databricks_mws_private_access_settings.pas.private_access_settings_id
}
Expected Behavior
- Update the doc to reflect that this feature works for AWS too.
- It should be possible to plan without expect "PROVISIONING" as status when workspace is already created.
- It should be possible to expose a variable to control if the user would like to verify reachability or not. Allowing to skip the call of function here https://github.com/databricks/terraform-provider-databricks/blob/main/mws/resource_mws_workspaces.go#L207
Actual Behavior
The field expected_workspace_status is specified on documentation as exclusive for GCP but it is affecting AWS as well.
Subsequent plans will fail because the workspace status is now RUNNING. So it will get this following error Error: cannot read mws workspaces: Workspace is running..
But when it is set to expect the workspace status as RUNNING, the provider will try to check workspace reachability and may can fail due to to DNS caching or if it is using private links without connectivity. It makes sense to expect running state to apply resources for workspace level, but when is only applying account level resources, it is not necessary and it is not possible to control this behavior. Then, it will fail after the default timeout of 20 minutes. In this case, to retry, it is necessary to delete the workspace manually to be recreated or by importing it to the state to avoid resource conflicting.
Steps to Reproduce
- terraform plan
Terraform and provider versions
terraform 1.7.5 databricks/databricks 1.97.0
Is it a regression?
Debug Output
Important Factoids
Would you like to implement a fix?
I could do a workaround to avoid waiting for reachability by setting the expected as PROVISIONING and addind time sleep resources as this
resource "time_sleep" "wait_workspace" {
depends_on = [
databricks_mws_workspaces.this
]
triggers = {
workspace_id = databricks_mws_workspaces.this.workspace_id
}
create_duration = "300s"
}
resource "databricks_metastore_assignment" "this" {
provider = databricks.accounts
metastore_id = ....
workspace_id = databricks_mws_workspaces.this.workspace_id
depends_on = [
time_sleep.wait_workspace
]
}
But it still failing for subsequent plans and maybe is not a good way to control the resource creation because this interval may vary
Can someone review and contribute with this PR fix: Accept running as valid wait state #5268? Thanks
Adding more detailed information:
Summary
The Read function for the workspace resource always calls WaitForExpectedStatus, regardless of cloud provider or resource usage.
This creates a series of problems when the workspace is already in RUNNING status but Terraform does not actually need to reach the workspace (e.g., blueprints that only provision account-level resources).
Because the provider unconditionally performs a reachability check whenever the expected status is RUNNING, Terraform can fail even when the workspace is fully deployed and healthy in the Accounts UI.
What the provider currently does
1. Read always calls WaitForExpectedStatus
expectedStatus := d.Get("expected_workspace_status").(string)
err = workspacesAPI.WaitForExpectedStatus(workspace, expectedStatus, d.Timeout(schema.TimeoutRead))
There is no cloud-specific logic such as:
if gcp {
WaitForExpectedStatus(...)
}
So the wait-and-reachability logic runs on AWS, GCP, and Azure equally.
2. When expected_workspace_status = RUNNING, reachability is enforced
Inside WaitForExpectedStatus:
case expectedStatus:
if expectedStatus == WorkspaceStatusRunning {
return a.verifyWorkspaceReachable(workspace)
}
return nil
Meaning:
- If the workspace reaches
RUNNING, - The provider still calls
verifyWorkspaceReachable, regardless of whether reachability is needed for the Terraform plan.
This is a problem when the Terraform runner cannot yet resolve the workspace DNS (e.g., stale DNS cache).
The problem
When the workspace is already RUNNING in the Accounts UI:
- The provider reaches the success branch…
- …but
verifyWorkspaceReachablefails due to DNS cache or network conditions. - This causes the whole
Readto fail with a timeout, even though the workspace is actually ready and healthy.
As a result:
Terraform state becomes inconsistent
-
Terraform believes the workspace failed or does not exist.
-
The workspace does exist in Databricks Accounts.
-
You must either:
- manually import the workspace into state or
- delete the workspace to recreate it and avoid “workspace already exists” conflicts.
Why setting expected_workspace_status = PROVISIONING is not a workaround
If I set:
expected_workspace_status = "PROVISIONING"
then the API returns the expected state:
workspace.WorkspaceStatus = "RUNNING"
But since RUNNING != PROVISIONING, the provider falls into the default branch:
default:
return resource.RetryableError(...)
It retries until timeout and still fails.
So:
RUNNINGexpected → fails due to reachabilityPROVISIONINGexpected → fails becauseRUNNINGis “unexpected”
No configuration avoids the failure.
Why this is incorrect behavior
- Many Terraform configurations only manage account-level resources.
- In these cases, workspace reachability should never be required.
- A DNS cache lasting longer than Terraform’s timeout should not break workspace creation.
WaitForExpectedStatusis enforced on all cloud providers and all use cases, so it is not exclusive for GCP as the doc says.
Proposed Fixes (any of the following would resolve the issue)
✔ Option A — Skip reachability check for account-level-only use cases
verifyWorkspaceReachable should not run when no workspace-level resources are referenced.
✔ Option B — Make reachability optional via a provider flag
Example:
provider "databricks" {
skip_workspace_reachability = true
}
✔ Option C — Treat RUNNING as a valid “higher-than” state when expected is PROVISIONING
This would prevent unnecessary retries.
Impact
This behavior:
- Breaks automations that create workspaces from runners with DNS caching.
- Causes false negatives in Terraform plans.
- Forces manual recovery via import or workspace deletion.
- Leads to significant friction in IaC workflows.
Environment
- Terraform runner with DNS cache older than Databricks workspace propagation time.
- Provider version: 1.97.0
- Cloud: AWS (but issue affects all providers)
Hi @devlucasc
Thank you for reaching out and for the detailed investigation and explanation. Thank you also for the PR contributing to the solution.
However, we do think that the PR could have unwanted side effects, so we are discussing internally whether a different approach is needed.
We will reach back once we have an decision.