cli icon indicating copy to clipboard operation
cli copied to clipboard

Add support for UC volumes in DABs

Open shreyas-goenka opened this issue 1 year ago • 1 comments

Changes

This PR adds the UC volumes resource to DABs. This PR also adds a warning if a user tries to use a volume in the artifact_path that is managed by DABs and has not been deployed yet. Users would need to do two incremental deploys in this case because we always upload artifacts before creating all the resources.

A proper fix where we can allow using the volume within the same deploy would require us to head in a direction of removing our dependency on terraform to manage the CRUD lifecycle of DABs resources.

Tests

Unit, integration test and manually.

Manual outputs:

  1. UC volume does not exist:
➜  bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/my_volume that is configured in the artifact_path: Not Found
  1. UC Volume does not exist, but is defined in DAB
➜  bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/managed_by_dab that is configured in the artifact_path: Not Found

Warning: You might be using a UC volume in your artifact_path that is managed by this bundle but which has not been deployed yet. Please deploy the UC volume in a separate bundle deploy before using it in the artifact_path.
  at resources.volumes.bar
  in databricks.yml:24:7

shreyas-goenka avatar Sep 09 '24 11:09 shreyas-goenka

Nightlies passed once. Triggered another round after some minor updates.

shreyas-goenka avatar Sep 16 '24 02:09 shreyas-goenka

Is there a potential release date for this feature? We're really looking forward to it

witi83 avatar Nov 05 '24 15:11 witi83

@witi83 Thanks for reaching out, this PR is still atleast a week or two from being merged and released.

The primary blocker is us trying to consolidate how we do prefixing for UC resources (with dev_<your-name> for example) when a bundle has mode: development set and we do not want to add support for UC volumes before we have clarity there.

We are actively working on unblocking and merging this PR.

shreyas-goenka avatar Nov 05 '24 15:11 shreyas-goenka

Hello! Thanks for adding this new feature -- I look forward to adopting it for many future projects.

One use-case I have: create a DAB-managed UC Volume for storing checkpoints directories for various Spark structured streams. I expect to to have a bunch of code that looks like this:

.option('checkpointLocation', os.path.join('/', 'Volumes', catalog, schema, 'checkpoints', 'my_structured_stream_name'))

It would be nice if I could have something like this:

.option('checkpointLocation', os.path.join(dbutils.wigest.get('checkpoints_volume'), 'my_structured_stream_name'))

The checkpoints_volume parameter could be passed in via the DAB job configuration:

        parameters:
            - name: checkpoints_volume
              default: ${resources.volumes.checkpoints.mount_location} # or similar

Is it possible to include a property like resources.volumes.checkpoints.mount_location to the DAB configuration so that we don't have to reconstruct the volume mount path with other partition parameters (e.g, catalog and schema)?

Thanks!

stefan-novak-brt avatar Nov 24 '24 06:11 stefan-novak-brt

Hi @stefan-novak-brt, Thanks for reaching out! The available API fields follows the API specification closely. Since the GET / CREATE API for UC volumes does provide a fields equivalent to mount location, you can't get the path in a single parameter.

There's a couple of alternatives though like:

        parameters:
            - name: catalog
              default: ${resources.volumes.checkpoints.catalog_name}
            - name: schema
              default: ${resources.volumes.checkpoints.schema_name}

OR

you could use the ID of the volume which would be catalog_name.schema_name.volume_name like (you'd have to replace the . with /):

        parameters:
            - name: volume_id
              default: ${resources.volumes.checkpoints.id}

I'd strongly recommend the first approach over the second one though.

shreyas-goenka avatar Nov 25 '24 21:11 shreyas-goenka

@shreyas-goenka Could you also merge in main? There are a few small conflicts.

pietern avatar Nov 28 '24 16:11 pietern

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger: go/deco-tests-run/cli

Inputs:

  • PR number: 1762
  • Commit SHA: d460bd68591f016748d1b06f635e648b82e077e0

Checks will be approved automatically on success.

github-actions[bot] avatar Dec 02 '24 15:12 github-actions[bot]

Test Details: go/deco-tests/12122596769

eng-dev-ecosystem-bot avatar Dec 02 '24 15:12 eng-dev-ecosystem-bot