bees icon indicating copy to clipboard operation
bees copied to clipboard

Tagging 0.6.2: Problems left?

Open kakra opened this issue 5 years ago • 9 comments

@Zygo Do we need another commit? I've tagged the top v0.6.y commit and injected it into my package manager to build a 0.6.2 version, here's the result:

Nov 01 21:46:09 jupiter beesd[2393154]: bees version v0.6.2
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: Calculating transid_max...
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: ---  END  TRACE --- exception ---
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]:
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: *** EXCEPTION ***
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]:         exception type std::system_error: BTRFS_IOC_INO_LOOKUP: rv = readlink(path.c_str(), buf, size + 1): No such file or directory at fs.cc:430: No such file or directory
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: ***
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: Calculating transid_max...
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: ---  END  TRACE --- exception ---
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]:
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: *** EXCEPTION ***
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]:         exception type std::system_error: BTRFS_IOC_INO_LOOKUP: rv = readlink(path.c_str(), buf, size + 1): No such file or directory at fs.cc:430: No such file or directory
Nov 01 21:46:09 jupiter beesd[2393154]: crawl_transid[2393213]: ***

Looks strange to me...

kakra avatar Nov 01 '19 20:11 kakra

Is it an empty filesystem? It might be a duplicate of #93

Zygo avatar Nov 06 '19 03:11 Zygo

No, it's an existing file system with existing hash table. Bees has already been running on it for months.

kakra avatar Nov 06 '19 08:11 kakra

Is there any new information on this?

Zygo avatar Nov 29 '19 01:11 Zygo

I didn't have time/resources yet to check again with that version.

kakra avatar Nov 29 '19 11:11 kakra

I've been using bees for ~1 year on a raid1 filesystem with 5 disks, quite successfully (thanks for this amazing piece of software by the way, I can't imagine the countless hours of work and the deep understanding you have to have of the btrfs data structures to be able to come up with bees, amazing).

I recently compiled the v0.6.2 version, and I'm now hitting this bug too, on the same filesystem. I'm available if you need me to git bisect this bug, or compile a weird custom version of the Linux kernel with some debug stuff added to it. By the way I'm currently running 5.5.0-rc6 with the btrfs patches from rc7.

EDIT: Hmm, this is weirder than I thought. I started to bisect for the sake of it, with cf9d1d0 working and 6e75857 (v0.6.2) not working (infinite error loop as OP), but quickly ended up having beesd versions that were segfaulting during start (!), then retried v0.6.2 on top of this, and now it works and I don't get an infinite loop. So this would seem to be related to the state of the beewscrawl.dat file? (I didn't save and restore it between each git bisect step, I probably should have)

EDIT2: v0.6.2 ended up in the error loop after some time while catching up the transactions of the last few days. So I guess I'll have to git bisect over the course of several days to pin down the problem correctly. Will report here when done.

speed47 avatar Jan 22 '20 19:01 speed47

Were there any snapshot deletes just before this? It looks like it might be trying to walk over the subvols to get transids, but fails because there's no path for them...all of which is relatively normal. But then it gets stuck trying not to return a huge transid_max that will break crawls...there might be something broken there.

Does bees eventually get out of this loop?

Does master do this? If not, it's past time to do a 0.7 release anyway... ;)

Zygo avatar Jan 22 '20 22:01 Zygo

Were there any snapshot deletes just before this? It looks like it might be trying to walk over the subvols to get transids, but fails because there's no path for them...all of which is relatively normal. But then it gets stuck trying not to return a huge transid_max that will break crawls...there might be something broken there.

I some automation around snapshots and they get created and deleted behind my back all the time, so this is likely.

Does bees eventually get out of this loop?

No, once it goes there, it repeats every second till stopped.

Does master do this? If not, it's past time to do a 0.7 release anyway... ;)

Apparently not, actually before attempting to go v0.6.2, I was following master, I think last compilation of bees from some months ago (which was running fine) was 7117cb4

speed47 avatar Jan 23 '20 08:01 speed47

Does master do this? If not, it's past time to do a 0.7 release anyway... ;)

I didn't see master doing this. Please let's go forward to 0.7 and discard the idea of 0.6.2. Some of the commits in the 0.6 branch after 0.6.1 seem to be incompatible with each other, or some bit is missing. The 0.6.2 version I experimentally tagged in my repo always shows this crash. No automated snapshots are running, created, or deleted in the background. If even you don't find the problem with advanced tools like bisect, even knowing and understanding the code, it may not be worth the effort trying to fix a soon to be abandoned 0.6 branch.

Maybe we should reset 0.6.x to 0.6.1 and only merge the compiler compatibility fixes, if 0.7 takes some more time. Master has proven to be working just fine for me on my personal system (kernel 5.4) and a container web server with high dedup potential (kernel 4.19), 24/7, for several months now, no problems even during IO stress.

kakra avatar Jan 23 '20 09:01 kakra

I seem to be running into the below as well:

exception type std::system_error: BTRFS_IOC_INO_LOOKUP: rv = readlink(path.c_str(), buf, size + 1): No such file or directory at fs.cc:430: No such file or directory

Bees runs fine for a few minutes or so, then an infinite spam of that occurs. I don't have any snapshots created/deleted, as suggested in this issue as a cause. Using 0.6.3, will test master

telans avatar Dec 10 '20 23:12 telans