containers icon indicating copy to clipboard operation
containers copied to clipboard

Use databricks container service for CI/CD

Open mpindado opened this issue 3 years ago • 2 comments

Hi, we are thinking about using container services and generate the docker images as the final artifacts, the same it is usually done in kubernetes or in any modern cloud facility supporting docker. For that, we create a dockerfile that copies all of our generated jars, including dependencies, into /databricks/jars and publish a docker image in a container registry. For example: our CI pipeline generates a software version 1.2.0-development.aabbbccdd that gets deployed as the docker image with that version. All of our libraries and dependencies are in /databricks/jars.

If we create a cluster and set the docker container image of that specific version, we can run a scala notebook which has access to our classes, so this is great in terms of CI/CD.

However, we have a "problem". We are launching jobs from Azure Data Factory which in turn, create DATABRICKS JOBS that will run in our cluster. The problem is that it seems mandatory to always add an "additional library". We understand that this requirement comes from the fact that this is emulating and spark-submit and we have to pass the jar with the main class. However, in our containers, we already have the jars inside and are in the classpath. It seems we can add any file (with any extension) as an additional library, as long as it exists in dbfs. So, we pass a dummy jar, and everything works.

We are worried about defining our CI/CD with something that may break in the future but we totally dislike the way databricks handles libraries and would prefer the docker approach.

So we have a few questions:

  • Have you had this problem before and it is something that it is being requested by customers?
  • is there any alternative?
  • do you suggest to not use containers this way?

best regards

mpindado avatar Dec 01 '21 13:12 mpindado

If I understand this correctly, the ask here is to not require an additional library when running a jar against a DCS cluster since you're baking the library into the cluster itself.

Are you using a job on an existing DCS cluster or does the job create a new DCS cluster?

evanye avatar Apr 07 '22 22:04 evanye

It's been a while since we post this, the approach we finally took was not to use the docker container but, to build the library and deploy in DBFS during the CI/CD. Then the datafactory job creates a new cluster job passing the path of the library in dbfs. Besides the problem of having to fake the api with a false library, we could not use private registries, so in the end using docker was impractical for us.

mpindado avatar Apr 08 '22 05:04 mpindado