failed to acquire reader of mount so we can return the accessor: no such file or directory
Celestia Node version
v0.12.1
OS
Ubuntu 22.04.1 LTS
Install tools
Node is compiled from source, tag v0.12.1 using "make celestia" Command to run the bridge: ./celestia-node/build/celestia bridge start --node.store nodebridge Command to run the node: ./celestia-app/build/celestia-appd start --home node
Others
The data is stored on a separate ZFS pool, 3TB of free space, no compression enabled. Around ~120Gb of free ram. Other data on the server are sane. I reproduced the issue 3 times, each time after a full sync
Steps to reproduce it
Start the sync of the bridge data, wait a bit for some synchronization then CTRL+C to stop it, then run again
Expected result
No error message
Actual result
2024-01-08T11:01:52.211Z WARN dagstore [email protected]/dagstore_async.go:39 acquire: failed to fetch from mount upgrader {"shard": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "error": "open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}
2024-01-08T11:01:52.212Z ERROR share/eds eds/store.go:194 removing shard after failure {"key": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "err": "failed to acquire reader of mount so we can return the accessor: open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}
2024-01-08T11:01:52.213Z ERROR share/eds eds/store.go:194 removing shard after failure {"key": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "err": "failed to acquire reader of mount so we can return the accessor: open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}
### Relevant log output
```Shell
https://pastebin.com/ASabF4gv
Notes
No response
Hey @wesraph , do the same shard keys appear on every restart?
Disregard my question, I just saw the pastebin and see how many blocks it is. Do you CTRL+C many times quickly in succession or only once and a wait time?
I restarted the node and checked if the same key appeared again in the logs and it doesn't seem to be the case I only CTRL+C one time to not corrupt the DB The logs seem to show the path starting with "/blocks", is it possible that the basePath get lost somewhere during the initialization of the node ?
I deployed the bridge + node on another machine (ubuntu 23.02, xfs as a filesystem, this time with the node version v0.12.3) and I get the exact same problem
just to note, i had 2 or 3 other bridge node runners report this problem, after they simply upgraded their nodes and restarted
If it happens after a simple upgrade and restart, it is probably just a corrupted filestore from a bad shutdown. I am continuing to investigate this, but for now the best thing to do is resync.
What is unsettling about this issue is that somehow, files that the dagstore has set as available (and were fine before shutdown) get removed from the filesystem somehow. @wesraph Could you try to reproduce, doing a `ls -al /blocks | wc -l" before and after restart? This will tell us if blocks are really getting removed, or if they are just being registered without actually being written to disk somehow.
Right after I stopped the node: 466026 30 secondes after I restarted it: 465915 1-2mn after I restarted it: 465335
After tracing the execution of the programm, I found out that the path that is used to open the block is not right:
futex(0x51876a8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc009393548, FUTEX_WAKE_PRIVATE, 1) = 1
write(1, "fetch /blocks/93CC3D14C94553A8F4"..., 79) = 79
openat(AT_FDCWD, "/blocks/93CC3D14C94553A8F4D6B5D03AA5AC11606F74A80D6CD1F1D1F6E105EDD45F21", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(2, "2024-01-13T10:41:35.029Z\t\33[33mWA"..., 318) = 318
In my case the blocks are stored at /mnt/nodebridge/blocks/
After doing a dirty fix to confirm my theory it now works without any errors (dagstore/mount/file.go:17):
func (f *FileMount) Fetch(_ context.Context) (Reader, error) {
fmt.Println("fetch", f.Path)
if strings.Index(f.Path, "/blocks") == 0 {
fmt.Println("Replacing")
return os.Open("/mnt/nodebridge" + f.Path)
}
return os.Open(f.Path)
}
The question now is why the path is lost during execution, I will give it a try on monday
one other bridge node runner had the same problem, and the issue was the ulimit setting which was set to 4096, he said that helped also with the issue above (not just the openfiles issue)
@mindstyle85 suggested I post my log here, I hope this could help. After I set a larger file limit and resync the data, I don't see these again.
after upgrading to 0.12.4 we are seeing issues
ERROR share/eds eds/store.go:199 failed to remove shard after failure {"key": "FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812", "err": "failed to drop index for FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812: remove /root/.celestia-bridge/index/FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812.full.idx: no such file or directory"}
however the head is increasing. is this related to the issue described here or should i open up another issue
@wesraph we actually had a similar issue we investigated last week. It would appear one of our data stores indexes full paths when originally saving. If you follow along the discussion and investigation at https://github.com/celestiaorg/celestia-node/issues/3450 you'll see a temp fix and explanation of an imminently arriving storage improvement that should prevent this issue