bacalhau
bacalhau copied to clipboard
Job Error Feedback and Node Capabilities
Not entirely sure how to structure this request; it may be more than one thing when we sketch it out.
There are several problems that are related:
- Job specifications can be invalid, like non-existant data, invalid commands, bad resource requests, or requesting an A100 when it doesn't exist. When this happens the job will be stuck in waiting, with no way of knowing what's happening.
- For GPU jobs, specifically, there's a wide selection of valid permutations to the request, like GPU type, required CUDA memory, CUDA library version, etc. How can we develop the code in a scalable way to allow jobs select the right options?
- If a node takes a GPU job but the container is old, it will fail because of a version mismatch. I think (untested) the job will error, but it would be nice to do this check BEFORE running the job?
- Is it possible for jobs to check for validity on the network at submission time, rather than runtime?
We need to have a planning session about this to figure out how we can split up and solve some of these issues.
relates to #404
related to https://github.com/filecoin-project/bacalhau/issues/342
Most of these issues are now solved by up front node ranking.