bacalhau icon indicating copy to clipboard operation
bacalhau copied to clipboard

make bacalhau resilient to flaky networks

Open lukemarsden opened this issue 2 years ago • 2 comments

there are lots of ways that job execution can fail if messages are dropped. make it more resilient!

lukemarsden avatar Aug 19 '22 09:08 lukemarsden

related: https://github.com/filecoin-project/bacalhau/issues/487

lukemarsden avatar Aug 19 '22 09:08 lukemarsden

also includes: https://github.com/filecoin-project/bacalhau/issues/320

lukemarsden avatar Aug 23 '22 15:08 lukemarsden

Timeouts were introduced by https://github.com/filecoin-project/bacalhau/pull/1061, which will allow failing the job early when messages are dropped or when nodes disappear, instead of having the jobs stuck with no progress.

Note that the requester node will fail the job instead of retrying or asking for more bids if the selected compute nodes are no longer responsive. Retrying will require more significant changes that we an revisit in the future if there is a demand for it.

wdbaruni avatar Nov 10 '22 22:11 wdbaruni