boost icon indicating copy to clipboard operation
boost copied to clipboard

Add cli command to download offline deal data

Open dirkmc opened this issue 2 years ago • 8 comments

Background

To download the data for an offline deal, a Storage Provider typically

  • downloads the data manually (eg using wget)
  • calls boostd import-data <deal uuid> <file path>

In practice Storage Providers may want to take advantage of the Boost UI to manage the download. The advantages are

  • Boost will keep track of how much space the download takes and prevent the SP from downloading more than will fit into their download staging area For example if the SP has a 500GB download staging area, and is downloading enough data to fit into 480GB, boost will prevent the SP from starting a download for a 32GB file.
  • Manage downloads in the Boost UI Boost shows the progress of downloads, automatically resumes downloads when boost is restarted, and allows the SP to manually cancel a download

Proposal

Add a boostd download-data command:

$ boostd download-data --help
NAME:
   boostd download-data - Download data for an offline deal made with Boost

USAGE:
   boostd download-data [command options] <proposal CID> or <deal UUID>

The command should output an error if downloading this data would exceed the available space in the staging area (including ongoing downloads). For example if

  • the staging area is 500GB
  • there are 400GB of completed downloads (whose deals have not been added to a sector)
  • there are in-progress downloads for deals whose total data size is 80GB
  • ie total "tagged" space in the download area is 400GB + 80GB = 480GB
  • the command should return an error if the user attempts to download a 32GB file

Note that if the size of the data to download was not specified as part of the deal, boostd should do an HTTP HEAD request to get the size from the client. If it's not possible to get the size, boostd should download the data anyway.

The command will process the deal in the same way as if it were an online deal:

  • download the data from the client
  • perform commp on the data
  • publish the deal
  • add the deal data to a sector
  • index and announce the deal

The advantage of using boostd download-data instead of doing on online deal is that the SP can control when to download the data (online deals start downloading immediately).

Related commands

SPs will likely want to use this command in scripts, so it would also be useful for us to document how to check the remaining space in the download staging area:

$ curl 'http://localhost:8080/graphql/query' \
  --data-raw '{"query":"query {  storage {    Staged    Transferred    Pending    Free    MountPoint  }}"}' | jq

{
  "data": {
    "storage": {
      "Staged": {
        "__typename": "BigInt",
        "n": "0"
      },
      "Transferred": {
        "__typename": "BigInt",
        "n": "25123893565"
      },
      "Pending": {
        "__typename": "BigInt",
        "n": "7966307422"
      },
      "Free": {
        "__typename": "BigInt",
        "n": "611154893413"
      },
      "MountPoint": "/home/nonsense/.boost/incoming"
    }
  }
}

dirkmc avatar May 09 '23 06:05 dirkmc

Couple of questions:

  1. automatically resumes downloads when boost is restarted <-- why do we expect the download to stop when boost stops?
  2. if the size of the data to download was not specified as part of the deal <-- what exactly do you mean by size of the deal? The padded piece size will be part of the deal proposal, but the size of the car file won't be included in an offline deal. See https://github.com/ribasushi/spade/blob/master/internal/filtypes/types.go#L24-L35 .

anjor avatar May 09 '23 10:05 anjor

why do we expect the download to stop when boost stops?

The boostd process is responsible for downloading deal data

what exactly do you mean by size of the deal

The data that is sent across the wire - typically this is a car file.

the size of the car file won't be included in an offline deal

It may not be, in which case boost should do an HTTP HEAD request to get the size (as noted in the description above)

dirkmc avatar May 09 '23 13:05 dirkmc

Yeah I was just flagging that afaik for offline deals there isn't a mechanism today that sends the size of the deal -- spade doesn't and boost doesn't either.

On a broader note, couple of other points: I am not quite sure if proceeding as an import deal after the data is downloaded is something the SPs would want. From what I understand the bottleneck is the publish storage deal batching and/or sealing capacity; especially when combined with the --start-epoch parameter setting a time limit.

The space budgeting + alerting on it is in my eyes a quality-of-life improvement. But tbh I reckon SPs already have monitoring/workarounds to handle quite a few of the issues there.

The main value add for me is to have this integrated in the boost UI so that the SP gets a unified view of the world. However, I don't think we should take away the control SPs have at each stage of the flow.

anjor avatar May 09 '23 13:05 anjor

@anjor can you expand a bit more on how this takes away control the SP has at the deal flow stage?

is it b/c in this proposal, after the download data command is run, the other steps happen automatically? and you're hearing that SPs want to separately handle each step?

brendalee avatar May 09 '23 18:05 brendalee

quick clarification - my understanding is this doesn't directly help with the control over when to send PSD messages, it's more around helping SPs leverage built in benefits in Boost when downloading data from clients instead of building bespoke tooling per SP to handle.

brendalee avatar May 09 '23 18:05 brendalee

Correct. We can and should confirm this via a SP survey, but I suspect the publish the deal and add the deal data to a sector steps being automatic is not something the SPs would want, instead they would want control over it.

anjor avatar May 09 '23 21:05 anjor

The typical deal flow is:

  1. Accept Deal
  2. Download / import deal data
  3. Verify commp over deal data
  4. Publish deal
  5. Add deal data to sector
  6. Index and announce deal

Note that for the Publish deal stage, the Storage Provider has configuration options to control when to publish the deal:

  • The max time to wait after receiving a deal before publishing it automatically (default 1 hour)
  • The max number of deals to receive before publishing them automatically (default 8)

The Storage Provider can also click a button / run a command to publish any deals in the publish queue immediately.

If SPs want to they can set these limits to a high number (eg limit 72 hours / limit 10,000 deals) and manually run the publish command when they see fit. So the SP already has full control of the publishing stage.

With respect to adding the deal data to a sector, lotus already provides back pressure: boost tells lotus that the deal is ready to be added to a sector and when there is free space in a sector, lotus adds the deal to a sector. If there is a reason that SPs want to wait to add the piece to a sector, even if the sector has free space, then we should give them this option (maybe there is a reason, I haven't heard of one).

dirkmc avatar May 10 '23 06:05 dirkmc

Ah maybe I misunderstood The command will process the deal in the same way as if it were an online deal: in the original post.

If the SP has control over when to publish deals + how to batch them, and on when to seal the data (within the start epoch param of course) then that's fine.

I still think it would be interesting to get some input from SPs directly to understand if they would use this feature.

anjor avatar May 10 '23 08:05 anjor