bacalhau
bacalhau copied to clipboard
make bacalhau resilient to flaky networks
there are lots of ways that job execution can fail if messages are dropped. make it more resilient!
related: https://github.com/filecoin-project/bacalhau/issues/487
also includes: https://github.com/filecoin-project/bacalhau/issues/320
Timeouts were introduced by https://github.com/filecoin-project/bacalhau/pull/1061, which will allow failing the job early when messages are dropped or when nodes disappear, instead of having the jobs stuck with no progress.
Note that the requester node will fail the job instead of retrying or asking for more bids if the selected compute nodes are no longer responsive. Retrying will require more significant changes that we an revisit in the future if there is a demand for it.