terraform-provider-databricks icon indicating copy to clipboard operation
terraform-provider-databricks copied to clipboard

[ISSUE] non-admin account: destroying a cluster with `access_control` leaves it in a broken state

Open m-theisen opened this issue 3 years ago • 13 comments
trafficstars

Hi there, We are in a situation where a different team creates and grands our developers access to a Azure Databricks instance. Due to compliance reasons our team is not allowed to have the admin permissions. However, we want to use Terraform to manage the "content" (clusters, repos, DBFS files, libraries, ...) in our Databricks instance., thinking that as long as the user used can do our desired tasks via the UI it should be able to do the same via Terraform.

Configuration

terraform {
    required_providers {
        databricks = {
            source = "databrickslabs/databricks"
            version = "0.5.0"
        }
    }
}

provider "databricks" {
}

data "databricks_node_type" "smallest" {
  local_disk = true
}
 
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}

resource "databricks_cluster" "some_cluster" {
    cluster_name = "terraform-test"
    spark_version = [data.databricks_spark_version.latest_lts.id](http://data.databricks_spark_version.latest_lts.id/)
    node_type_id = [data.databricks_node_type.smallest.id](http://data.databricks_node_type.smallest.id/)
    autotermination_minutes = 10

    spark_conf = {
      # Single Node
      "spark.databricks.cluster.profile" : "singleNode"
      "spark.master" : "local[*]"
    }

    custom_tags = {
      "ResourceClass" = "SingleNode"
    }
}

resource "databricks_permissions" "cluster_usage" {
  cluster_id = databricks_cluster.some_cluster.cluster_id

  access_control {
    group_name = "group-name"
    permission_level = "CAN_MANAGE"
  }

}

Expected Behavior

Delete the defined cluster.

Actual Behavior

The operation fails with this error: Error: <account-name> does not have Owner permissions on <cluster-id>. Please contact the owner or an administrator for access.

Additionally, the cluster is left in a state where only administrators can interact with it (destroy, start, stop, manage permissions, ...).

Steps to Reproduce

Please list the steps required to reproduce the issue, for example: 0. use a non-admin Databricks account

  1. terraform apply
  2. terraform destroy

Terraform and provider versions

Terraform v1.1.6 on windows_amd64

Important Factoids

The account used is not an admin account.

m-theisen avatar Feb 23 '22 15:02 m-theisen

@m-theisen is the cluster created by the same user?

nfx avatar Feb 25 '22 19:02 nfx

@nfx yes it is. It is created in the terraform apply step (with the same user).

My suspicion of what happens is that on the destroy command terraform removes the cluster_usage resource, which removes its own cluster manage permission. What I understand from the documentation this should not happen because the cluster owner should retain/get back its permissions.

m-theisen avatar Feb 25 '22 20:02 m-theisen

@m-theisen Please provide debug logs with API traffic for further investigation.

Primary use of terraform is for the admin level privileges

nfx avatar Feb 26 '22 10:02 nfx

@nfx sorry for not providing the log from the start but here it is now: https://gist.github.com/m-theisen/1b9cfed16e687ef6810aac9c5d7c133f

m-theisen avatar Mar 02 '22 14:03 m-theisen

@m-theisen okay, I see now from logs what the problem you're facing. Currently core provider team is busy on other priorities, and would be glad if you submit a Pull Request with a fix concerning your situation. Take a look ar permissions/resource_permissions.go, specifically for jobs handling, and add a corner case for "removing permissions" on cluster, by setting the CAN_MANAGE level to creator_user_name from ClusterInfo. And a unit test, of course.

nfx avatar Mar 02 '22 14:03 nfx

@nfx thanks for the quick feedback! I have to see if I get the time to create this PR myself. I'll come back to you once I do.

m-theisen avatar Mar 02 '22 14:03 m-theisen

Contributing.md has local dev env instructions, so probably start there

nfx avatar Mar 02 '22 14:03 nfx

@m-theisen any updates on the PR for this?

nfx avatar Mar 07 '22 11:03 nfx

@nfx actually yes. I was looking into what needs changing and as you said it isn't so much.

But while investigating this I noticed something else (you are probably already aware of). We will run into this particular "destroy issue" not only with clusters but any other resource to which we attach a "non-default" permission. Since destroying the permission object means overwriting the existing list with the "admin-only" permission list, which prevents our non-admin user from modifying the resource further.

In the case of clusters there's the solution of assigning the cluster owner CAN_MANAGE permissions upon destruction (which I think also makes sense in the context of the Databricks access policies, considering that these are the default permissions). As far as I could see most other resources don't have something like an owner (which makes sense as well).

So currently we are re-evaluating if we can use terraform for what we have in mind or not. Currently I don't see a solution for our permission "problem" apart from giving the current user CAN_MANAGE permissions for each resource where the attached permission gets destroyed. But I don't really think this is a viable solution which should be implemented here.

The conclusion is: I think adding the corner case for the cluster permissions makes sense, but it became a lower priority for us/me.

m-theisen avatar Mar 07 '22 12:03 m-theisen

In the case of clusters there's the solution of assigning the cluster owner CAN_MANAGE permissions upon destruction

@m-theisen that's what i've meant

nfx avatar Mar 07 '22 12:03 nfx

In the case of clusters there's the solution of assigning the cluster owner CAN_MANAGE permissions upon destruction

@m-theisen that's what i've meant

Yes, that's how I understood you. But while looking into how I would implement this I noticed the other "problems".

m-theisen avatar Mar 07 '22 12:03 m-theisen

@m-theisen

This is not terraform best practice, but I believe it might be a workaround for the issue you are encountering.

Try: terraform apply terraform state list terraform state rm databricks_permissions.cluster_usage terraform destroy

jasontay-db avatar Apr 08 '22 21:04 jasontay-db

I'm seeing an error that I believe is related to this issue. terraform version=0.15.5 terragrunt version=0.27.1 Databricks provider version=0.3.7

We're creating an all purpose cluster using a service principal. We want to also grant access to a Databricks group, using the following code:

resource "databricks_permissions" "cluster_usage_group" {
  cluster_id = databricks_cluster.aqua_demo.cluster_id

  access_control {
    group_name       = var.team_name
    permission_level = "CAN_RESTART"
  }

}

Doing this creates the cluster and grants this access to the team. However, it replaces the owner permissions from the service principal. When we don't include this block, the service principal retains its permissions. Our deployment succeeds but the service principal doesn't have access to delete the cluster later since it's no longer an owner. We also tried to get the cluster ID as an output from the deployment, and this failed because the service principal no longer had access to the cluster to get it's ID. I think the root of the issue that @m-theisen faced. When adding cluster usage permissions, it removed manage access from the creator of the cluster

zroley23 avatar Apr 29 '22 16:04 zroley23

Following up - is this issue still relevant?

nfx avatar Aug 22 '22 09:08 nfx

@nfx we would still very much like to manage our resources in Databricks via terraform with the limited non-admin user. We are, however, currently not investing time to solve this cluster permission issue, without having a clear way forward how we could also solve the problem for the other resource types.

m-theisen avatar Aug 25 '22 07:08 m-theisen

@m-theisen have you tested this with the latest version of the provider? there was a fix for MLflow model and iirc it applies to all databricks_permissions resource

nkvuong avatar Aug 25 '22 08:08 nkvuong

Thanks @nkvuong for pointing this out. That sounds promising. We have not tested it yet, though. I won't have time this week but I'll hopefully be able to get back to you next week.

m-theisen avatar Aug 25 '22 08:08 m-theisen

@nkvuong I was finally able to do some testing using the databricks/databricks v1.2.1 and unfortunately the issue seems to persist. I had a look at the PR you linked and if I am not mistaken (I'm not at all fluent in go) this line

	for _, prefix := range [...]string{"/registered-models/"} {

explicitly restricts the new feature to only apply to the ML Models and not to other resources, like clusters.

m-theisen avatar Sep 02 '22 10:09 m-theisen