cli icon indicating copy to clipboard operation
cli copied to clipboard

Bundle deploy / destroy errors on a few jobs only

Open cptshrk108 opened this issue 10 months ago • 9 comments

Describe the issue

When using the deploy or destroy command, certain jobs are not getting created or deleted. The jobs are never the same one, and it's only ever 5-10 on a total of 100.

Running the command a second time will create or destroy the remaining jobs.

Configuration

Using the commands from local to a testing environment where no other jobs or bundles can conflict. None of the jobs are running or modified.

Steps to reproduce the behavior

Please list the steps required to reproduce the issue, for example:

  1. Run databricks bundle deploy ... or
  2. Run databricks bundle destroy...
  3. See error

Expected Behavior

All jobs go through.

Actual Behavior

Some jobs don't go through

OS and CLI version

Databricks CLI v0.239.1

Is this a regression?

Same error on Databricks CLI v0.236.0

Debug Logs

Error: terraform apply: exit status 1

Error: cannot delete job: 


Error: cannot delete job: 


Error: cannot delete job: 


Error: cannot delete job: 


Bundle destroy successfully.

cptshrk108 avatar Jan 29 '25 00:01 cptshrk108

Thanks for reporting the issue.

Does the output not include any error messages?

I suspect this could be related to some kind of rate limiting, but haven't seen this before.

pietern avatar Jan 29 '25 08:01 pietern

No other errors no. I get the regular logging, where all affected jobs are listed, then the operation starts, then the only error is what I posted. The jobs are not named, but they are still present in the workspace. Running the command a second time deletes/creates them.

cptshrk108 avatar Jan 29 '25 14:01 cptshrk108

The creation error is a bit more verbose, but not any more details :


Error: cannot create job: 

  with databricks_job.star-slv-lnd-slv-dom,
  on bundle.tf.json line 33893, in resource.databricks_job.star-slv-lnd-slv-dom:
33893:       },


Error: cannot create job: 

  with databricks_job.tpa-raw-ini-brz-init,
  on bundle.tf.json line 35802, in resource.databricks_job.tpa-raw-ini-brz-init:
35802:       },


Error: cannot create job: 

  with databricks_job.virage-brz-inc-brz-mrg,
  on bundle.tf.json line 38097, in resource.databricks_job.virage-brz-inc-brz-mrg:
38097:       },


Error: cannot create job: 

  with databricks_job.virage-raw-inc-brz-lnd,
  on bundle.tf.json line 43642, in resource.databricks_job.virage-raw-inc-brz-lnd:
43642:       },


Error: cannot create job: 

  with databricks_job.virage-raw-ini-brz-lnd,
  on bundle.tf.json line 45509, in resource.databricks_job.virage-raw-ini-brz-lnd:
45509:       },

cptshrk108 avatar Jan 29 '25 14:01 cptshrk108

Still running into this issues at the moment and it is only solved by deploying twice.

Is there a rate limit on the api that is not being respected?

cptshrk108 avatar Mar 17 '25 19:03 cptshrk108

Thanks for bumping this. Could you send an email to [email protected] with the workspace IDs where this is happening? Then I can escalate internally. This seems to be an issue with the Jobs API rather than DABs itself.

pietern avatar Mar 20 '25 14:03 pietern

Bump. Having the same issue. Out of 100-ish jobs/workflows different ones fail on DAB deployment.

Dedvall avatar Apr 11 '25 12:04 Dedvall

Ran into the issue again this morning. Same behavior as before, running the deploy again fixes the issue. No other info than : Error: cannot create job: An unexpected error occurred

@pietern

cptshrk108 avatar Apr 24 '25 15:04 cptshrk108

Hi, I'm also consistently facing this issue.

On deploy this happens for seemingly random set of jobs (2-4 out of ~50):

Error: cannot create job: An unexpected error occurred

  with databricks_job.XXX,
  on bundle.tf.json line 736, in resource.databricks_job.XXX:
 736:       },

On destroy, Error: cannot delete job: An unexpected error occurred is printed. It takes second destroy to wipe out these.

svatopluk-sperka avatar Apr 25 '25 08:04 svatopluk-sperka

Appreciate the additional reports of this issue. The team is working on a backend fix to address the underlying issue.

In the meantime, not using tags on your jobs should reduce the probability of this happening.

pietern avatar Apr 25 '25 08:04 pietern

A fix for this issue has been rolled out.

Could you retry deploying/destroying your bundles and see if the issue no longer occurs? Thank you!

pietern avatar May 02 '25 08:05 pietern

Deployment now works without issues for me. Thanks @pietern!

svatopluk-sperka avatar May 02 '25 09:05 svatopluk-sperka

Thanks @pietern , the bug was intermittent so I will trust the team and reopen a ticket if it comes up again :)

cptshrk108 avatar May 02 '25 13:05 cptshrk108