Add support for UC volumes in DABs
Changes
This PR adds the UC volumes resource to DABs. This PR also adds a warning if a user tries to use a volume in the artifact_path that is managed by DABs and has not been deployed yet. Users would need to do two incremental deploys in this case because we always upload artifacts before creating all the resources.
A proper fix where we can allow using the volume within the same deploy would require us to head in a direction of removing our dependency on terraform to manage the CRUD lifecycle of DABs resources.
Tests
Unit, integration test and manually.
Manual outputs:
- UC volume does not exist:
➜ bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/my_volume that is configured in the artifact_path: Not Found
- UC Volume does not exist, but is defined in DAB
➜ bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/managed_by_dab that is configured in the artifact_path: Not Found
Warning: You might be using a UC volume in your artifact_path that is managed by this bundle but which has not been deployed yet. Please deploy the UC volume in a separate bundle deploy before using it in the artifact_path.
at resources.volumes.bar
in databricks.yml:24:7
Nightlies passed once. Triggered another round after some minor updates.
Is there a potential release date for this feature? We're really looking forward to it
@witi83 Thanks for reaching out, this PR is still atleast a week or two from being merged and released.
The primary blocker is us trying to consolidate how we do prefixing for UC resources (with dev_<your-name> for example) when a bundle has mode: development set and we do not want to add support for UC volumes before we have clarity there.
We are actively working on unblocking and merging this PR.
Hello! Thanks for adding this new feature -- I look forward to adopting it for many future projects.
One use-case I have: create a DAB-managed UC Volume for storing checkpoints directories for various Spark structured streams. I expect to to have a bunch of code that looks like this:
.option('checkpointLocation', os.path.join('/', 'Volumes', catalog, schema, 'checkpoints', 'my_structured_stream_name'))
It would be nice if I could have something like this:
.option('checkpointLocation', os.path.join(dbutils.wigest.get('checkpoints_volume'), 'my_structured_stream_name'))
The checkpoints_volume parameter could be passed in via the DAB job configuration:
parameters:
- name: checkpoints_volume
default: ${resources.volumes.checkpoints.mount_location} # or similar
Is it possible to include a property like resources.volumes.checkpoints.mount_location to the DAB configuration so that we don't have to reconstruct the volume mount path with other partition parameters (e.g, catalog and schema)?
Thanks!
Hi @stefan-novak-brt, Thanks for reaching out! The available API fields follows the API specification closely. Since the GET / CREATE API for UC volumes does provide a fields equivalent to mount location, you can't get the path in a single parameter.
There's a couple of alternatives though like:
parameters:
- name: catalog
default: ${resources.volumes.checkpoints.catalog_name}
- name: schema
default: ${resources.volumes.checkpoints.schema_name}
OR
you could use the ID of the volume which would be catalog_name.schema_name.volume_name like (you'd have to replace the . with /):
parameters:
- name: volume_id
default: ${resources.volumes.checkpoints.id}
I'd strongly recommend the first approach over the second one though.
@shreyas-goenka Could you also merge in main? There are a few small conflicts.
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:
Trigger: go/deco-tests-run/cli
Inputs:
- PR number: 1762
- Commit SHA:
d460bd68591f016748d1b06f635e648b82e077e0
Checks will be approved automatically on success.
Test Details: go/deco-tests/12122596769