web3.storage
web3.storage copied to clipboard
Give up trying to pin a CID after a given time threshold
If you send a pinning request for CID, but CID doesn't exist / node is offline, our PIN will stay stuck in "queued" status. We should abandon the operation after a given time threshold has passed.
Update
This can be taken care of by the Pinning API in Elastic Provider, when it takes over from Cluster.
Impact
- (Infra) Decrease load on Cluster, which translates to a decreased use of resources
- (Biz) Reduce chances of an overwhelmed cluster in the near future
- (User) If we land this, automatically clean up hanging requests translates to less housekeeping the user would have to do to "clean up" requests.
Acceptance Criteria
- [ ] After a given time threshold
giveUpThreshold
, the Cluster should stop trying to get and pin a given CID, if there are no more recentPinningRequests
for the same CID or Uploads - [ ] PinninRequests that were created before
giveUpThreshold
should report afailed
status if there are no more recentPinningRequests
. - [ ] PinninRequests that were created after
giveUpThreshold
should report their effective status, based on cluster state. - [ ] Ability to clean existing Pinning Requests.
Notes.
- What happens if there's a pinning request for
CID_A
, which is "expired" but a chunked upload for the sameCID_A
exists. In this case, we might have 2 scenarios:- A chunk upload is in progress
- A chunk upload is failed in practice
- consider removing nonexistent CIDs from the content table.
- The suggested threshold for
giveUpThreshold
, is 1 day. Could be even smaller, let's parametrise it for easy updating. - At the moment cluster could report failed transient states, I wonder if those shouldn't be reflected to psa statuses? We should consider never sending a failed status until threshold is reached.
To be discussed with @alanshaw
Discussed with @alanshaw @flea89 @francois-potato.
All things that cannot be pinned, will be added to a separate queue that keeps growing. In the meantime cluster will keep trying to pin it. This is not an immediate concern but in the future cluster might fall over if the queue grows too much.
We need to find a way for these CIDs to be dropped from cluster.
We need to define the threshold (i.e. after how long, not how many times tried). We also need to find a way to surface this information to the user - a sort of perma-failed status.