databricks-asset-bundles-dais2023 icon indicating copy to clipboard operation
databricks-asset-bundles-dais2023 copied to clipboard

Upload jar library without syncing other files/folders

Open MrJSmeets opened this issue 2 years ago • 14 comments

Hi,

I would like to upload an existing JAR as a dependent library to a job/workflow without having to sync any other files/folders. Currently, all files/folders are always synchronized, but I don't want to sync these. I only need the jar in the target/scala-2.12 folder.

sync:
  include:
    - target/scala-2.12/*.jar

Folder structure:

.
├── README.md
├── build.sbt
├── databricks.yml
├── src
│   └── main
│       ├── resources
│       │   └── ...
│       └── scala
│           └── ...
└── target
    ├── global-logging
    ├── scala-2.12
        └── xxxxxxxxx-assembly-x.x.x.jar

With dbx, this was possible by using file references. What is the recommended way to do this via DAB, without syncing other files/folders?

I expected this to be possible via artifacts, but that seems to be (for now?) only intended for Python wheels.

MrJSmeets avatar Sep 25 '23 15:09 MrJSmeets

By default, DABs exclude files and folders from syncing based on .gitgnore file if you're using Git. If you're not using Git, or don't want to include certain files in .gitignore you can use sync.exclude property.

sync:
  exclude:
    - src/**/*
    - databricks.yml
    - build.sbt
    - target/global-logging/*

andrewnester avatar Sep 27 '23 14:09 andrewnester

Thanks andrewnester, then it seems that uploading a jar via this synchronisation method is not really the right way for Scala projects. I will instead upload my jar to adls/s3 and put the databricks.yaml file in a subfolder so I don't have to clutter my job definitions with this list of excludes.

Hopefully something similar to the file references at dbx will be available in the future. That made it very useful to upload a JAR with the job definition during local development.

MrJSmeets avatar Sep 28 '23 13:09 MrJSmeets

Have there been any updates on this feature? We are also struggling to manage deployments of jar files as part of the DAB deployment. It doesn't seem to work via include because there isn't support for the jar file in artifacts and it complains about not having a relevant artifact specification.

mike-smith-bb avatar Apr 22 '24 15:04 mike-smith-bb

@mike-smith-bb DABs already supports building and automatic upload of JARs, so the configuration can look somewhat like this

artifacts:
  my_java_project:
    path: ./path/to/project
    build: "sbt package"
    type: jar
    files:
      - source: ./path/to/project/targets/*.jar

Note that you have to explicitly specify files source section to point where built jars are located

Also please make sure you're using latest CLI version (0.217.1 as of now).

If you still experience any issues feel free to open issue in CLI repo here https://github.com/databricks/cli/issues

andrewnester avatar Apr 22 '24 16:04 andrewnester

Thanks, @andrewnester . Your suggestion, I think, assumes that we are building the artifact as part of the DAB deployment. What if the jar file is built by a different process and we are simply trying to include it as part of the job cluster created and want to store it in the DAB structure. Is this supported?

mike-smith-bb avatar Apr 22 '24 17:04 mike-smith-bb

@mike-smith-bb yes, indeed.

Then just using sync include section should work, does it for you?

sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

andrewnester avatar Apr 23 '24 09:04 andrewnester

sync:
  include:
    - target/scala-2.12/**/*.jar

Path can be defined in gitingore like syntax so it should be pretty flexible to only match what you need

@andrewnester — This doesn't seem to work with jar files. Even if I sync the file like you showed, I can't add those jar files as dependencies.

If I do

sync:
  include:
    - resources/lib/*
...
resources:
  jobs:
    my_job:
      name: My Job
      tasks:
        - task_key: mytask
          notebook_task:
            notebook_path: ../src/mymodule/myfile.py
          job_cluster_key: job_cluster
          libraries:
            - jar: /Workspace/${workspace.file_path}/resources/lib/my_custom.jar

I get this error:

image

I'm assuming it's because of this:

image

Do we have any options available to add a jar dependency from source like we used to do with dbx?

jmatias-gilead avatar May 07 '24 22:05 jmatias-gilead

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

mike-smith-bb avatar May 08 '24 13:05 mike-smith-bb

We found the same and came to the same conclusion. Seems like we need a pre-deploy step that can inject the jar/dependencies into a volume or cloud storage and also manage the location of the dependency through the sync and dependency configs. Interested in other approaches here.

Ok, I will do that in the meantime and see how that goes.

jmatias avatar May 09 '24 09:05 jmatias

I'm having the same issue. I have a Databricks Volume for jar libraries. My current workaround is just using the was cli to upload the files before deploying the bundle. However, what if it's an internal/managed Volume? I think the databricks could include the option to upload a file to a Volume

fernanluyano avatar Sep 27 '24 17:09 fernanluyano

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

andrewnester avatar Sep 30 '24 08:09 andrewnester

Since version 0.224.0 DABs support uploading JARs to UC Volumes, you can find an example here: https://github.com/databricks/bundle-examples/blob/main/knowledge_base/spark_jar_task/databricks.yml

You can omit all artifacts section if you don't want JAR to be rebuilt automatically as part of the deploy and just deploy the one that is being referenced from libraries field

Thanks for sharing @andrewnester.

What happens if i want to upload the jar and deploy my workflows using the same bundle? If I set the artifact_path to the UC Volume then the whole bundle will be deployed there, no? Though perhaps that wouldn't be a bad thing...

jmatias avatar Oct 02 '24 23:10 jmatias

@jmatias no, not really, artifact_path is a path to upload local libraries to, DABs actually doesn't yet support deploying the whole bundle to Volumes (this would be file_path config)

andrewnester avatar Oct 03 '24 12:10 andrewnester

@andrewnester it worked!

jmatias avatar Oct 07 '24 17:10 jmatias