celestia-node failed to acquire reader of mount so we can return the accessor: no such file or directory

Celestia Node version

v0.12.1

OS

Ubuntu 22.04.1 LTS

Install tools

Node is compiled from source, tag v0.12.1 using "make celestia" Command to run the bridge: ./celestia-node/build/celestia bridge start --node.store nodebridge Command to run the node: ./celestia-app/build/celestia-appd start --home node

Others

The data is stored on a separate ZFS pool, 3TB of free space, no compression enabled. Around ~120Gb of free ram. Other data on the server are sane. I reproduced the issue 3 times, each time after a full sync

Steps to reproduce it

Start the sync of the bridge data, wait a bit for some synchronization then CTRL+C to stop it, then run again

Expected result

No error message

Actual result

2024-01-08T11:01:52.211Z	WARN	dagstore	[email protected]/dagstore_async.go:39	acquire: failed to fetch from mount upgrader	{"shard": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "error": "open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}
2024-01-08T11:01:52.212Z	ERROR	share/eds	eds/store.go:194	removing shard after failure	{"key": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "err": "failed to acquire reader of mount so we can return the accessor: open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}
2024-01-08T11:01:52.213Z	ERROR	share/eds	eds/store.go:194	removing shard after failure	{"key": "CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B", "err": "failed to acquire reader of mount so we can return the accessor: open /blocks/CFF004BEFB51B86D26CBD23C013CF339B7C8EC60E5952649CF8E459C36985B9B: no such file or directory"}

### Relevant log output

```Shell
https://pastebin.com/ASabF4gv

Notes

No response

Jan 08 '24 11:01 wesraph

Hey @wesraph , do the same shard keys appear on every restart?

Jan 08 '24 11:01 distractedm1nd

Disregard my question, I just saw the pastebin and see how many blocks it is. Do you CTRL+C many times quickly in succession or only once and a wait time?

Jan 08 '24 11:01 distractedm1nd

I restarted the node and checked if the same key appeared again in the logs and it doesn't seem to be the case I only CTRL+C one time to not corrupt the DB The logs seem to show the path starting with "/blocks", is it possible that the basePath get lost somewhere during the initialization of the node ?

Jan 08 '24 11:01 wesraph

I deployed the bridge + node on another machine (ubuntu 23.02, xfs as a filesystem, this time with the node version v0.12.3) and I get the exact same problem

Jan 11 '24 08:01 wesraph

just to note, i had 2 or 3 other bridge node runners report this problem, after they simply upgraded their nodes and restarted

Jan 12 '24 12:01 mindstyle85

If it happens after a simple upgrade and restart, it is probably just a corrupted filestore from a bad shutdown. I am continuing to investigate this, but for now the best thing to do is resync.

What is unsettling about this issue is that somehow, files that the dagstore has set as available (and were fine before shutdown) get removed from the filesystem somehow. @wesraph Could you try to reproduce, doing a `ls -al /blocks | wc -l" before and after restart? This will tell us if blocks are really getting removed, or if they are just being registered without actually being written to disk somehow.

Jan 12 '24 12:01 distractedm1nd

Right after I stopped the node: 466026 30 secondes after I restarted it: 465915 1-2mn after I restarted it: 465335

Jan 13 '24 10:01 wesraph

After tracing the execution of the programm, I found out that the path that is used to open the block is not right:

futex(0x51876a8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc009393548, FUTEX_WAKE_PRIVATE, 1) = 1
write(1, "fetch /blocks/93CC3D14C94553A8F4"..., 79) = 79
openat(AT_FDCWD, "/blocks/93CC3D14C94553A8F4D6B5D03AA5AC11606F74A80D6CD1F1D1F6E105EDD45F21", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(2, "2024-01-13T10:41:35.029Z\t\33[33mWA"..., 318) = 318

In my case the blocks are stored at /mnt/nodebridge/blocks/

After doing a dirty fix to confirm my theory it now works without any errors (dagstore/mount/file.go:17):

func (f *FileMount) Fetch(_ context.Context) (Reader, error) {
    fmt.Println("fetch", f.Path)
    if strings.Index(f.Path, "/blocks") == 0 {
        fmt.Println("Replacing")
        return os.Open("/mnt/nodebridge" + f.Path)
    }
    return os.Open(f.Path)
}

The question now is why the path is lost during execution, I will give it a try on monday

Jan 13 '24 10:01 wesraph

one other bridge node runner had the same problem, and the issue was the ulimit setting which was set to 4096, he said that helped also with the issue above (not just the openfiles issue)

Jan 16 '24 11:01 mindstyle85

37184-1000-f.zip

@mindstyle85 suggested I post my log here, I hope this could help. After I set a larger file limit and resync the data, I don't see these again.

Jan 17 '24 06:01 HoytRen

after upgrading to 0.12.4 we are seeing issues

ERROR share/eds eds/store.go:199 failed to remove shard after failure {"key": "FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812", "err": "failed to drop index for FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812: remove /root/.celestia-bridge/index/FFCAFB7D3FE212403B1A940C00B0B0FC36AD17A8EA3403EC6D7E183F2EE7E812.full.idx: no such file or directory"}

however the head is increasing. is this related to the issue described here or should i open up another issue

Jan 23 '24 03:01 Staking7pc

@wesraph we actually had a similar issue we investigated last week. It would appear one of our data stores indexes full paths when originally saving. If you follow along the discussion and investigation at https://github.com/celestiaorg/celestia-node/issues/3450 you'll see a temp fix and explanation of an imminently arriving storage improvement that should prevent this issue

Jun 19 '24 11:06 ramin