zellij icon indicating copy to clipboard operation
zellij copied to clipboard

Panic when run from $HOME

Open alerque opened this issue 2 years ago • 11 comments

Essentially a duplicate of #2123 that was closed without resolution, but I'm running under Arch Linux, not Nix.

$ cd
$ zellij
Error occurred in server:

  × Thread 'wasm' panicked.
  ├─▶ Originating Thread(s)
  │     1. ipc_server: NewClient
  │     2. screen_thread: NewTab
  │     3. plugin_thread: NewTab
  │
  ├─▶ At library/std/src/sys/unix/time.rs:66:9
  ╰─▶ assertion failed: tv_nsec >= 0 && tv_nsec < NSEC_PER_SEC as i64

Starting anywhere other than $HOME seems to work fine.

I don't think my debug log provides any clues any more than the last issue, but it is consistently reproducible.

Trying to strace it to figure out what files get inspected didn't turn up anything two interesting; but this bit early on is weird:

readlink("/proc/self/exe", "/usr/bin/zellij", 256) = 15
mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f22d4422000
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7f22d4422000, stack_size=0x9000}, 88) = 184788
munmap(0x7f22d4422000, 36864)           = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
wait4(184788, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 184788
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=184788, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
socket(AF_UNIX, SOCK_STREAM, 0)         = 5
connect(5, {sa_family=AF_UNIX, sun_path="/run/user/1000/zellij/0.35.2/adorable-bears"}, 110) = -1 ENOENT (No such file or directory)
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=50000000}, 0x7ffe40880400) = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 6
connect(6, {sa_family=AF_UNIX, sun_path="/run/user/1000/zellij/0.35.2/adorable-bears"}, 110) = 0
setsockopt(6, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
dup(6)                                  = 7
write(6, "\222\201\251NewClient\225\223\222]\315\1Y\222\334\0\23\247Default\244"..., 8190) = 8190
write(6, "MoveFocus\244Down\201\244Char\241h\221\201\251MoveFoc"..., 8180) = 8180

What is adorable-bears supposed to be?

Later on we get to the actual panic:

write(1, "\33[1mLoading Zellij\33[m\n", 22) = 22
write(1, "\r", 1)                       = 1
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x557e9f79ca58, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x557e9e8e4db0, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x557e9e8e4db0, FUTEX_WAKE_PRIVATE, 1) = 0
write(1, "Querying terminal emulator for \33"..., 89) = 89
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x557e9f79ca58, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x557e9f79ca58, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY) = 0
write(1, "done", 4)                     = 4
write(6, "\222\254ClientExited\221\226\201\246Client\244Exit\245Em"..., 59) = 59
ioctl(0, TCSETS, {c_iflag=ICRNL|IUTF8, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
write(1, "\33[?1006l\33[?1015l\33[?1002l\33[?1000l", 32) = 32
write(1, "\33[?1049l\n\33[93;1HError occurred i"..., 985) = 985

alerque avatar Apr 11 '23 13:04 alerque

Hey @alerque - thanks for opening this issue.

As I mentioned in the previous issue, we're going to need a little bit more to go on here. Because while I completely believe you that this is happening for you and 100% reproducible, it is not for me on any machine that I have access to.

I'm going to need to know what's different about your home folder that's causing this. Could you please try to troubleshoot this a little further? Namely by finding out which file in your $HOME folder is causing this? Either that or creating a new user with an empty $HOME and seeing it it happens to it? Maybe taking it from there?

What is adorable-bears supposed to be?

That's the session name and thus also the IPC pipe file.

I appreciate the strace, but I did not see anything out of the ordinary there.

imsnif avatar Apr 11 '23 14:04 imsnif

We need help pinpointing this issue. Could you test some things like outlined in https://github.com/zellij-org/zellij/issues/2369#issuecomment-1513359137 @alerque?

raphCode avatar Apr 18 '23 15:04 raphCode

We now know what is going on, see #2369 for details.

raphCode avatar Apr 27 '23 21:04 raphCode

Yes, I can confirm my issue is the same (also BTRFS):

$ stat /home/caleb
  File: /home/caleb
  Size: 4796            Blocks: 0          IO Block: 4096   directory
Device: 0,40    Inode: 1655        Links: 1
Access: (0700/drwx------)  Uid: ( 1000/   caleb)   Gid: ( 1000/   caleb)
Access: 2023-04-28 10:57:49.392684901 +0300
Modify: 2023-04-28 10:58:12.866211620 +0300
Change: 2023-04-28 10:58:12.866211620 +0300
 Birth: 8299573362414321664.1801675124
fn main() {
    dbg!(std::fs::metadata("/home/caleb").unwrap());
}
$ rustc foo.rs
$ ./foo
[foo.rs:2] std::fs::metadata("/home/caleb").unwrap() = Metadata {
    file_type: FileType(
        FileType {
            mode: 16832,
        },
    ),
    is_dir: true,
    is_file: false,
    permissions: Permissions(
        FilePermissions {
            mode: 16832,
        },
    ),
    modified: Ok(
        SystemTime {
            tv_sec: 1682668672,
            tv_nsec: 432709445,
        },
    ),
    accessed: Ok(
        SystemTime {
            tv_sec: 1682668669,
            tv_nsec: 392684901,
        },
    ),
thread 'main' panicked at 'assertion failed: tv_nsec >= 0 && tv_nsec < NSEC_PER_SEC as i64', library/std/src/sys/unix/time.rs:77:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

That being said I think there should be a tracking issue here until zellij has a way to handle this situation without panicing. Even though the actual bug(s) are upstream, this project should find a way to not panic on something that isn't critical to it's function. Since the other issue was closed I suggest keeping this one open until there is a fix in place.

alerque avatar Apr 28 '23 08:04 alerque

Having just run across the recent RustLab talk on zellij I was reminded about how much I'm missing out on still stuck in tmux. But panicking when run from my $HOME is kind of a rough pill to swallow to get started.

Would it be possible to work around this upstream Rust problem similarly to this: https://github.com/sxyazi/yazi/pull/357. It seems like they are intercepting the Panic somehow.

alerque avatar Feb 12 '24 21:02 alerque

I'm pretty sure I fixed this in https://github.com/zellij-org/zellij/pull/3126

imsnif avatar Feb 12 '24 21:02 imsnif

@imsnif Unfortunately I just built and tested from Git HEAD (v0.39.2.r41.gb677ffe), a commit at which that PR should be included and sadly it does not fix this issue.

alerque avatar Feb 13 '24 08:02 alerque

Ah, that's a pity.

Just to recap: this particular bug happens due to an upstream Rust std bug that does not handle corrupted file birthtimes well (which appear to happen every now and then with btrfs for unknown reasons), and does not give the application any reasonable way to mitigate. While I agree we should do everything in our power to get around this - the solution from our side is not trivial.

In order to do this we would have to catch this panic and replace the offending corrupted timestamp. We have to build internal infrastructure for this (seeing as we do different things with our panic hooks).

Since we are really only talking about a single file here (the home folder) - up until then I'd advise users to work around this by adjusting the file's metadata in some manual way (if you find one, please post it here).

Sorry I don't have a better answer for this.

imsnif avatar Feb 13 '24 09:02 imsnif

I do understand this is a nasty upstream bug and that Rust doesn't give you an easy way to "catch" it. Unfortunately fixing the file system seems like a non-starter because the troublesome node is the root of the file system, not some file or directory we can re-created. It's the top level in the mount point. And the birth time seems to be immutable in BTRFS, at least no tooling I've found is able to change it. I'm not the only one out there with a block device with this problem.

I wonder if some provision could be made for this with having to catch the panic by just avoiding even asking for the information in the first place. I don't know what the condition would look like (env var? check for btrfs? system config file with a node blacklist?) but for some special condition perhaps zellij could just avoid asking for the birth time in the first place. What purpose is it even serving?

alerque avatar Feb 13 '24 10:02 alerque

Not really, I'm afraid. This is a metadata request made by wasmer on plugin load, we cannot configure our way out of it, we can only elect not to mount a filesystem for plugins at all. In the basic case this will solve the problem (Zellij will not crash on startup when loading from a btrfs filesystem with a corrupt birthtimestamp in the home folder), but it will be kicking the problem down the road to whenever plugins that require a filesystem are loaded.

It will be doing so in a non-trivial-to-understand way that will require both further explanation and maintenance from maintainers. It will be doing so for a very small fraction of users in order to mitigate a bug in an upstream dependency caused by another bug in the user's environment.

I am truly very sorry you are one of those small fraction of users experiencing this bug - but please remember: Zellij is pre 1.0 and maintained by a small group of overworked and mostly unpaid maintainers. I have to make hard decisions about what we're going to fix (and later maintain).

imsnif avatar Feb 13 '24 11:02 imsnif

As an overworked and underpaid prolific contributor to FOSS project I understand you have to make hard choices.

The closest I've gotten to a way to fix this is this set of proposed kernel patches which seem to enable writing btime on BTRFS. I haven't even figured out what kernel version these should be applied to much less tried to work out what a C program that would run on such a kernel and could be used to fix corrupted timestamp entries would look like—at least not yet. Maybe if somebody else out there has this issue with zellij or any other affected Rust app they can get a bit farther.

alerque avatar Feb 13 '24 12:02 alerque