bacalhau
bacalhau copied to clipboard
download timeouts shouldn't time out if they're making progress
if bacalhau get
is making progress with a download, it should never time out
right now it times out after 10 seconds (!) which I am about to change to 600
same for bacalhau docker run --download
@simonwo thinks he's set them all to 5 minutes
@philwinder points out that there's some cruft in the CLI flags still --download-timeout-secs
we should delete this if it's not used
check how big file is as we go, bonus points for showing the progress to the user! even .
s or whatever like wget
@wdbaruni does your timeout work touch this at all?
could we spin up a go-routine alongside the IPFS download that checks the size of the download target each loop and if it's grown since last time not cancel the context?
Adding a fixed length of timeout is gonna suffer from the how long is a piece of string
problem
This might not work however - it depends on how IPFS is handling the download - i.e. if it's spitting bytes out into the target location as it's downloading or if it's downloading somewhere else before moving when complete
The current problem is that the context that the content is being downloaded in already has a timeout already applied to it - a good example of this is the Docker executor is downloading the inputs within the RunShard
call. Even if there was a context aware of how much was being downloaded, the context handed to the executor would have still timed out once the content was downloaded.
To avoid timing out, we'd need to have:
- Inputs handed to the executor rather than be something the executor fetches
- Exclude storage provider work from the
timeout
passed in by the user - Have some way of having a context that will cancel itself if IPFS isn't downloading something. Note that this could potentially fall foul of a slow network - someone sends 1 packet every minute to keep the download progressing.
likely cause of #1863
Notes from investigating the code base:
- Difficulty in supporting WASM being able to download various modules - writing the API would be hard but not insurmountable. Most of the benefit would come from the common parts anyway.
- The job timeout is currently used as a hard deadline for when a shard gets cancelled. Set at https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/requester/shard_fsm.go#L426, read at https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/requester/shard_fsm.go#L151-L163
- The job timeout starts as soon as the executor starts doing work - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/compute/executor_buffer.go#L128
- Executor will count all jobs that the executor has dequeued against its available resources, even if the
compute
part hasn't started yet - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/compute/executor_buffer.go#L176 - The IPFS storage has a custom timeout applied - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/storage/ipfs/storage.go#L73 - but this is never set, so falls back to the default of 5 minutes - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/config/config.go#L60
- There's a comment in IPFS storage - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/storage/ipfs/storage.go#L155-L157 - that states that if the path already exists, then IFPS has successfully downloaded it. This is not true as files get created as soon as
files.WriteTo
is called - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/ipfs/client.go#L165 - The
node
retrieved byAPI.Unixfs().Get
could be aFile
orDirectory
- aFile
implementsio.Reader
so we could use aio.TeeReader
to peak at the progress - https://go.dev/play/p/N6xL8_fnV2, but aDirectory
would be more complicated. We would need to write acontext.Context
that could have its cancellation time extended after it was created. We may have to reimplement theWriteTo
so that we can peak at the written files - https://github.com/ipfs/go-ipfs-files/blob/v0.2.0/filewriter.go
Please investigate if still relevant
A timeout is a timeout even if progress is being made in downloading results. Most of these issues are related to bad ipfs configuration. As IPFS is no longer the default publisher, and as we are moving aware from embedding ipfs within bacalhau to relying on --ipfs-connect
, this shouldn't be a problem
Closing