bacalhau icon indicating copy to clipboard operation
bacalhau copied to clipboard

download timeouts shouldn't time out if they're making progress

Open lukemarsden opened this issue 2 years ago • 4 comments

if bacalhau get is making progress with a download, it should never time out

right now it times out after 10 seconds (!) which I am about to change to 600

lukemarsden avatar Aug 16 '22 13:08 lukemarsden

same for bacalhau docker run --download

lukemarsden avatar Aug 16 '22 13:08 lukemarsden

@simonwo thinks he's set them all to 5 minutes

lukemarsden avatar Oct 17 '22 09:10 lukemarsden

@philwinder points out that there's some cruft in the CLI flags still --download-timeout-secs we should delete this if it's not used

lukemarsden avatar Oct 17 '22 09:10 lukemarsden

check how big file is as we go, bonus points for showing the progress to the user! even .s or whatever like wget

lukemarsden avatar Oct 17 '22 09:10 lukemarsden

@wdbaruni does your timeout work touch this at all?

aronchick avatar Nov 21 '22 04:11 aronchick

could we spin up a go-routine alongside the IPFS download that checks the size of the download target each loop and if it's grown since last time not cancel the context?

Adding a fixed length of timeout is gonna suffer from the how long is a piece of string problem

This might not work however - it depends on how IPFS is handling the download - i.e. if it's spitting bytes out into the target location as it's downloading or if it's downloading somewhere else before moving when complete

binocarlos avatar Feb 07 '23 10:02 binocarlos

The current problem is that the context that the content is being downloaded in already has a timeout already applied to it - a good example of this is the Docker executor is downloading the inputs within the RunShard call. Even if there was a context aware of how much was being downloaded, the context handed to the executor would have still timed out once the content was downloaded.

To avoid timing out, we'd need to have:

  • Inputs handed to the executor rather than be something the executor fetches
  • Exclude storage provider work from the timeout passed in by the user
  • Have some way of having a context that will cancel itself if IPFS isn't downloading something. Note that this could potentially fall foul of a slow network - someone sends 1 packet every minute to keep the download progressing.

wjam avatar Feb 07 '23 11:02 wjam

likely cause of #1863

lukemarsden avatar Feb 07 '23 11:02 lukemarsden

Notes from investigating the code base:

  • Difficulty in supporting WASM being able to download various modules - writing the API would be hard but not insurmountable. Most of the benefit would come from the common parts anyway.
  • The job timeout is currently used as a hard deadline for when a shard gets cancelled. Set at https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/requester/shard_fsm.go#L426, read at https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/requester/shard_fsm.go#L151-L163
  • The job timeout starts as soon as the executor starts doing work - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/compute/executor_buffer.go#L128
  • Executor will count all jobs that the executor has dequeued against its available resources, even if the compute part hasn't started yet - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/compute/executor_buffer.go#L176
  • The IPFS storage has a custom timeout applied - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/storage/ipfs/storage.go#L73 - but this is never set, so falls back to the default of 5 minutes - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/config/config.go#L60
  • There's a comment in IPFS storage - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/storage/ipfs/storage.go#L155-L157 - that states that if the path already exists, then IFPS has successfully downloaded it. This is not true as files get created as soon as files.WriteTo is called - https://github.com/filecoin-project/bacalhau/blob/19c0973e28cdd80513d8e9eeebbe97d4431151f9/pkg/ipfs/client.go#L165
  • The node retrieved by API.Unixfs().Get could be a File or Directory - a File implements io.Reader so we could use a io.TeeReader to peak at the progress - https://go.dev/play/p/N6xL8_fnV2, but a Directory would be more complicated. We would need to write a context.Context that could have its cancellation time extended after it was created. We may have to reimplement the WriteTo so that we can peak at the written files - https://github.com/ipfs/go-ipfs-files/blob/v0.2.0/filewriter.go

wjam avatar Feb 07 '23 14:02 wjam

Please investigate if still relevant

aronchick avatar Jan 05 '24 16:01 aronchick

A timeout is a timeout even if progress is being made in downloading results. Most of these issues are related to bad ipfs configuration. As IPFS is no longer the default publisher, and as we are moving aware from embedding ipfs within bacalhau to relying on --ipfs-connect, this shouldn't be a problem

Closing

wdbaruni avatar Apr 15 '24 09:04 wdbaruni