levant Levant deployment doesn't fail when some allocations failed to place

Description

When I run levant deploy that results in this kinda error:

2018-12-06T10:08:27+01:00 |ERRO| levant/deploy: task group backend failed to place allocs, failed on [alpha bravo charlie] and exhausted [memory cpu] job_id=iakov-test
2018-12-06T10:08:27+01:00 |ERRO| levant/deploy: task group dep failed to place allocs, failed on [charlie alpha bravo] and exhausted [network: reserved port collision cpu] job_id=iakov-test
2018-12-06T10:08:27+01:00 |INFO| levant/deploy: beginning deployment watcher for job job_id=iakov-test

Levant doesn't exit immidiately, but tries to deploy the job. I don't think this is the desired behaviour. In my case, my backend and dependencies were never started, only front-end task is running. Levant fails in 10 minutes (healthy_deadline), instead of failing immidiately.

Dec 06 '18 09:12 iakovmarkov

@iakovmarkov I suspect that levant is just reporting back from nomad on this, but will wait for jrasell to weigh in.

Dec 06 '18 12:12 stevenscg

Yep, that is exactly what happens. I'm just a bit puzzled why levant doesn't exit on failure.

I'm about to write a custom Python to work around this issue, but I'd rather see this behaviour in levant. Maybe hidden under a flag for backwards-compatibility, something like --exit-on-failure. I'm up to implement it if @jrasell is okay with it.

Dec 06 '18 13:12 iakovmarkov

Hey @iakovmarkov, when running this deployment does Nomad trigger a deployment within its own system? It is possible for a single task group to fail, but others within a job to succeed, so in this situation you would want to follow the deployment until the end.

Dec 17 '18 11:12 jrasell

Yes, a Nomad deployment is triggered.

In our Nomad jobs all the tasks are basically mandatory and are interconnected with each other - temporary DB, API and front-end app. If one of those fails to deploy, the whole job is broken. I imagine that's the usual use-case for Nomad jobs.

I understand that some task groups may succeed, but I don't really understand why Levant waits for the rest of the deployment. It seems counter-intuitive - if a job failed to deploy something, I expect the CI job to fail immediately, and not run for 10 minutes, trying to deploy the rest of it. It will still fail in the end with the message like |ERRO| levant/deploy: deployment xxx has status failed job_id=iakov-test.

Dec 17 '18 11:12 iakovmarkov