cli icon indicating copy to clipboard operation
cli copied to clipboard

Support for post deploy hooks

Open hejcman-enverus opened this issue 1 month ago • 3 comments

Problem statement

I found 2 ways to overload existing approaches to run our pre-deploy hooks. However, I found no way to implement post deploy hooks in the asset bundle.

Post-deploy hooks could be useful in circumstances where further setup of the environment in Databricks is required beyond what is offered by asset bundles. Our specific use case involves setting up tables in the schema deployed in the asset bundle resources and running migrations on top of that table. However, other uses (running a deployed job, cleaning up the local environment, ...) could be explored as well.

Adding this directly to the bundles instead of relying on our CI/CD environment allows better support for local development and the VSCode Databricks extension.

Existing solutions

It is possible to overload the databricks-bundles Python library or the artifacts mapping to run arbitrary commands or Python functions before deploying an asset bundle.

Artifacts mapping

We can execute custom commands in the build section of the artifacts mapping:

artifacts:
  pre_deploy_hook:
    build: "echo Pre deploy hook"

As far as I can tell, we don't need to produce any actual artifacts in this mapping.

Databricks Bundles Library

A cleaner approach is to use the databricks-bundles Python library and run our custom function before returning the loaded bundle resources:

experimental:
  python:
    resources:
      - "resources:load_resources"
def load_resources(bundle: Bundle) -> Resources:
    my_custom_function()
    return load_resources_from_current_package_module()

Proposed solution

The simplest solution would probably be using databricks-bundles and adding a hook section to the python directive in the databricks.yml file where we could define our post_deploy function to run after a successful deployment of the asset bundle:

experimental:
  python:
    resources:
      - "resources:load_resources"
    mutators:
      - "resources:some_mutator"
    hooks:
      - "resources:post_deploy"
def post_deploy(bundle_summary: Bundle) -> None:
    run_my_hooks()

A more involved solution could be to add a hooks mapping directly to the databricks.yml, which would allow defining pre/post hooks for the deploy, run, destroy actions independently:

hooks:
  pre_deploy:
    - name: "Run tests"
      command: "pytest tests"
  post_destroy:
    ...

Expected outcome

A discussion whether this feature makes sense to more users and the development team. Is this particular feature something that you would be interested in persuing or accepting a PR on?

Since the databricks-bundles library repository is currently not public, I am unfortunately unable to open the discussion there.

hejcman-enverus avatar Oct 22 '25 08:10 hejcman-enverus

Thanks for the detailed proposal!

I'd love to learn more how you're planning to use this. What kind of pre-deploy hooks do you have in mind? I get that running pytest before deploying can be done, though would argue that belongs to a separate process (in CI/CD or local development loop). The post-deploy hooks are interesting. You mention running something that sets up tables or performs schema migrations. Would you run that as a job itself, or as a CLI command? We have had ideas floating around to make job runs part of the deployment graph, so you could do something like 1) deploy a job, 2) run the job to ensure a table exists, 3) deploy something else that depends on the table from step 2 existing.

Having more examples of pre and post deploy actions will help figure out the right solution.

Btw, there is no databricks-bundles; it is included in this repository under experimental/python.

pietern avatar Oct 23 '25 07:10 pietern

Thanks for your response.

The main reason we went the way of pre-deploy hooks was the lack of post-deploy hooks. However, I believe we discovered that the combination of pre-deploy and post-deploy hooks would be the more flexible solution. We are currently in between the following 2 solutions.


Our initial solution - job based

This is essentially step 1 and 2 from the workflow that you proposed, with job runs being part of the bundle deployment. This is the approach we currently have implemented in our CI:

  1. Define a "Unity setup" job with all our migrations
  2. Define the schema as a bundle resource
  3. Deploy the bundle
  4. Run the unity setup job with databricks bundle run

This way we have a "Unity setup" job in Databricks for each of our projects - ideally, that would all be part of the deployment itself and wouldn't be listed as a Databricks job alongside the main project jobs, but thats a small issue. The larger issue is that this doesn't work outside of our CI, because we have no way to trigger the job run automatically when users do deployments locally, which is a big use case for us.

Furthermore, we decided to limit ourselves to using serverless compute in the unity setup job. We found that the 5-10 minute delay caused by cluster setup for our migration task was detrimental to our development workflow. However, this does limit what we can do in the unity setup job.

This is why we are currently moving to the following approach...


Other solution - local based

This workflow would leverage local tools to prepare the Databricks environment before the bundle deployment:

  1. Create the bundle schema with the Databricks SDK
  2. Run our migrations on top of that schema using Databricks SQLAlchemy and Alembic
  3. Deploy the bundle

This would all happen in the load_resources function in PyDABs like I outlined before. This approach provides more freedom but obviously has some drawbacks:

  • Our bundle resource definition is fragmented between the SDK and the bundle YAMLs.
  • No support for databricks bundle destroy - since the schema is now not created with terraform, and we have no way to trigger its cleanup, it lingers in Databricks until we remove it manually (which is unfeasible to do at scale).
  • Overloading the load_resources function doesn't feel quite right, especially for a bundle setup which we want to enforce across multiple teams.

I think the best solution is the combination of the two approaches. Having job runs be part of the bundle deployment is definitely a step in the right direction, and would solve some of our usecases, but it feels more limiting than what we would be able to do with PyDABs.

Since we are now leaning more into PyDABs in our organisation (validation of some bundle parameters, resource names, etc.), extending the functionality of PyDABs with pre and post hooks feels like a more natural development for us.

hejcman-enverus avatar Oct 23 '25 15:10 hejcman-enverus

In light of our previous discussion, I outlined a possible implementation of the post-deploy hooks using PyDABs in PR https://github.com/databricks/cli/pull/3902.

Hoping we can get this discussion going again. Are my proposed changes in line with what you would support?

hejcman-enverus avatar Nov 11 '25 07:11 hejcman-enverus