Pinning large CID is "stuck" at same node
Version information:
go-ipfs version: 0.8.0
Repo version: 11
System version: arm64/linux
Golang version: go1.15.8
I have 22GiB+ on MFS data on a node at Vultr. Latest MFS CID is pointed to via /ipns/hpkg.haiku-os.org
Description:
- I pin /ipns/hpkg.haiku-os.org on my home IPFS node. (either of the below, same result)
-
ipfs pin add --progress /ipns/hpkg.haiku-os.org -
ipfs pin add --progress /ipfs/QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
-
- It downloads 20GiB of data
- pin stuck on:
Fetched/Processed 95036 nodesfor like 2 days. - Restart ipfs. 20GiB deleted (--enable-gc, ugh)
- I pin /ipns/hpkg.haiku-os.org again
- 24 hours later, 22 GiB of data in repo.
- pin once again stuck on: "
Fetched/Processed 95036 nodes" for ~12 hours now.
sysstat from home node(while stuck pinning)
# sar 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 05/02/21 _aarch64_ (4 CPU)
13:06:38 CPU %user %nice %system %iowait %steal %idle
13:06:43 all 0.10 12.60 4.63 0.15 0.00 82.51
Average: all 0.10 12.60 4.63 0.15 0.00 82.51
root@ipfsgw:~# sar -d 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local) 05/02/21 _aarch64_ (4 CPU)
13:05:56 DEV tps rkB/s wkB/s areq-sz aqu-sz await svctm %util
13:06:01 dev179-0 0.20 0.80 0.00 4.00 0.00 1.00 4.00 0.08
13:06:01 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
13:06:01 dev8-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Debug from home node (while stuck pinning)
- curl localhost:5001/debug/pprof/goroutine?debug=2 > ipfs.stacks
- curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
- curl localhost:5001/debug/pprof/heap > ipfs.heap
- curl localhost:5001/debug/vars > ipfs.vars
- ipfs diag sys > ipfs.sysinfo
@kallisti5 could you grab a little more debugging info:
- Vultr node
- the debug info
-
ipfs config show - run
ipfs dag stat --offline QmSSxDK8br4Z5UHQBXZWuXew8uP2uJ98oYT1HQAw7QrwKTto make sure it completes ok- If not then it means the data isn't on the node you think it's on
- reboot the node and see if the problem resolves on the client end
- To avoid any debugging confusion around data discovery from the client try doing
ipfs swarm connect /full/address/of/vultr/nodeon the client node after the reboot
- To avoid any debugging confusion around data discovery from the client try doing
- home node
-
ipfs config show -
ipfs bitswap statshould show some of the blocks you are waiting on, if those blocks are part of theQmSS...graph it could be easier to determine what's going on
-
This might be a duplicate of #7972, which appears to be fixed in master of go-bitswap (the dependency update isn't yet included in go-ipfs master).
Home node (wants)
ipfs@ipfsgw:~$ ipfs bitswap stat
bitswap status
provides buffer: 0 / 256
blocks received: 117310
blocks sent: 12037
data received: 23934805330
data sent: 2561571207
dup blocks received: 4393
dup data received: 691026392
wantlist [1 keys]
Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
partners [763]
Vultr node (has):
ipfs bitswap stat
bitswap status
provides buffer: 0 / 256
blocks received: 0
blocks sent: 99009
data received: 0
data sent: 22941823079
dup blocks received: 0
dup data received: 0
wantlist [0 keys]
partners [785]
Vultr Node (has) (putting the wantlist from my home node into dag stat for the vultr node:)
# ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
Size: 0, NumBlocks: 1
Vultr Node (has):
root@ipfs:~# ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 5111615603, NumBlocks: 22289
(still running. I'll update here if it fails)
I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.
I noticed you updated the CID in your comment that you were trying to pin, since you have garbage collection enabled then it's possible that Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD has since been cleaned up on the Vultr node and so if the home node is still looking for it it's not going to find it.
I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.
Ooh.. I think this might be the root cause. I posted ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD above from the Vultr node showing it did indeed have that block (though oddly empty). (Size: 0, NumBlocks: 1
)
Check this out (vultr node)
root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}
Here's the completed diag stat for the whole thing on the vultr node:
ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 22827570312, NumBlocks: 95529
So... locally it thinks it has that block, but remotely it reports not having it?
root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD {"data":null,"links":[]}
This seems a bit weird to me since I'd expect an empty DagPb node to have the hash QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n DagPb spec. I wonder if there is some datastore corruption (which would cause your node to send back data but my node to not recognize it as having the CID I was expecting).
Could you run ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car and then ipfs add the car file and post the CID here so I can take a look? Note: What I'm asking you to do is basically export the DAG underneath that CID (which is just a single node in this case) to a file so we can investigate it even if it has some corruption that is preventing it from being sent over the network.
It may take a while but it'd be helpful to run ipfs repo verify to check on repo corruption issues.
# ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car
0s 92 B / ? [-------------------------------------------------------------------------------------------------------------------=-------------------------------------------------------------------------------------------] 3.22 KiB/s 0s
# ipfs add empty.car
added QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc empty.car
Running a ipfs repo verify now
❯ ipfs dag import .\QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc
Pinned root Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD FAILED: blockstore: block not found
Error: unable to pin all roots: 1 out of 1 failed
Ya, something weird is going on. I'm guessing a corrupted block. The CAR file contains a single block which is the empty block with the CID QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n.
Yup.
# ipfs repo verify
block QmTRQhKWeLJUDvLqxDZmBkLwJSSy7B3Er9ytAhQcDiG1Lf was corrupt (block in storage has different hash than requested)
block QmZ7rUYoMm4cpnxCHS9KBJeuUPmcyZsg4Qgi3EAjUq3iTM was corrupt (block in storage has different hash than requested)
block Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD was corrupt (block in storage has different hash than requested)
How do you even fix that when it's MFS? (is there a way to identify which files and re-upload them? I don't really understand how they got corrupted.. the node is brand new and I just added the files without any real issues.
To wrap this one up. I ended up just deleting the repo and re-adding the data under flatfile instead of badger. Things are smooth now.
Since this is the "seed" node for MFS data, repo stability is important. It sounds like badger is better for user nodes, while flat is better for "critical data sources"
However... this case might be useful for some enhancement? I had CID's that nothing saw as invalid (and the MFS filesystem was fully available) until I did a full repo validation. This seems like a potentially common issue.
I have had random stuck events.
But retry after the interruption passed.
I am using go-http-api-client
I was stuck for a whole day while pinning a cid.
Is this the same problem? @kallisti5
@godcong
I have had random stuck events. But retry after the interruption passed.
There's a mix of issues here:
- Repo corruption isn't obvious, and IPFS doesn't "notice" until you repo verify (which is slow)
- Pinning large data sets, if IPFS can't find a 'valid' copy of data it just hangs 'waiting' for it
- Pinning progress isn't really clear (https://github.com/ipfs/go-ipfs/issues/8112)
- Enabled garbage collection and a low GCThreshold means if you 'interrupt' a pin of data over your GCThreshold, the node immediately will garbage collect your progress.
- https://github.com/ipfs/go-ipfs/issues/3121 seems to be the "fix" for this one?
- An alternative "nice feature" would be to wait XX hours before garbage collecting any new data chunks. (giving users a chance at "resuming")
These are the "biggest" issues we encountered trying to use IPFS at scale. They represent some really frustrating user experiences.
For this sub-problem:
Pinning large data sets, if IPFS can't find a 'valid' copy of data it just hangs 'waiting' for it
New strategies introduced in https://github.com/ipfs/kubo/pull/10754 provides alternative to low level pins – partial DAGs can now be opportunistically protected from GC and provided, and you can preload as much as you can over time.