bacalhau Job Error Feedback and Node Capabilities

Job Error Feedback and Node Capabilities

Open philwinder opened this issue 2 years ago • 1 comments

Not entirely sure how to structure this request; it may be more than one thing when we sketch it out.

There are several problems that are related:

Job specifications can be invalid, like non-existant data, invalid commands, bad resource requests, or requesting an A100 when it doesn't exist. When this happens the job will be stuck in waiting, with no way of knowing what's happening.
For GPU jobs, specifically, there's a wide selection of valid permutations to the request, like GPU type, required CUDA memory, CUDA library version, etc. How can we develop the code in a scalable way to allow jobs select the right options?
If a node takes a GPU job but the container is old, it will fail because of a version mismatch. I think (untested) the job will error, but it would be nice to do this check BEFORE running the job?
Is it possible for jobs to check for validity on the network at submission time, rather than runtime?

We need to have a planning session about this to figure out how we can split up and solve some of these issues.

Jul 22 '22 08:07 philwinder

relates to #404

Aug 23 '22 15:08 lukemarsden

related to https://github.com/filecoin-project/bacalhau/issues/342

Nov 10 '22 22:11 wdbaruni

Most of these issues are now solved by up front node ranking.

Nov 14 '23 21:11 simonwo