darling icon indicating copy to clipboard operation
darling copied to clipboard

Darling wiped my external drive (twice...)

Open HinTak opened this issue 5 years ago • 20 comments

I don't know how it happened - but for some reason, I found that darling has wiped my external drive (under /run/media/<myid>/diskname - mounted by udisk2) twice, on launchctl shutdown.

The first time, about 6 weeks ago, was quite painful - 350GB from a 1TB USB drive formatted to ext4 was gone - and still suffering from it . Surviving files are either owned by root (not my user account) or in directories which I happened to not have write access (ie mode 444).

The 2nd time, just now, is only a 32 GB drive about 2/3 full... All the files were mine, so all of it gone.

The first time it happened it was late at night, I was doing too many things on too many terminals, I thought I might have accidentally cut-paste some combination of rm -rf (and I have been kicking myself for a month) - the 2nd time I am not doing anything else, other than just using da rling for about 39 minutes and quit.

What exactly does darling shell does to clean up the working directory overlay and private mounts???

This bug is too scary :-( .

HinTak avatar Jul 22 '20 01:07 HinTak

This is why we always tell you to never run Darling on the host, always in a VM. (Not this in particular, but the possibility of bugs like this.)

What exactly does darling shell does to clean up the working directory overlay and private mounts???

Nothing. When the container exits, Linux cleans up the mounts automatically. You'll have to look into what launchd does, if this happens when you tell launchd to shut down.

bugaevc avatar Jul 22 '20 04:07 bugaevc

I am guessing some bad interaction between darling and udisks2 mounted disks.

Oh, I was also using DPREFIX this time, and was probably last time too (last time I had more than 1 darling sessions, one in default ~/.darling, and the other in ~/d1). This time I was continuing with ~/d1 . But I had used darling briefly between the two events, don't remember which one I was using.

HinTak avatar Jul 22 '20 10:07 HinTak

My "/" and "/home" are both LVM volumes with ext4 .

Obviously /home is visible as /SystemRoot/home (and /run/media/myid/diskname similar) - and at least I am glad it did not touch /home.

Hmm, udisks2 has a message about cleaning up /run/media/myid/diskname , which usually just means removing the empty directory mount point, which it previously creates , when disks are unmounted.

This bug is way too scary to have seen once, let alone twice...

HinTak avatar Jul 22 '20 11:07 HinTak

One other anomaly: right after the most recent wipe, obviously I wanted to reboot soon (even after unloading the kernel module). It appears that some of /run/users/1000/ like the ssh_auth_sock and gnome /dbus might have gone too - I was surprised I was asked the password when I git push'ed , also twice (I pushed two things and git asked in both times) which ssh-agent should cache, and after I closed all my gnome terminals I could no longer start new - apparently something to do with gnome terminal server process was also broken. So really not much to do other than reboot...

Anyway, /run/user/1000/ is on tmpfs and stores many of gnome state files. It is called the "XDG_RUNTIME_DIR". SSH_AUTH_SOCK is keyring/ssh under that.

My guess at the point is darling has a tendency of destroying other per-user mounts ...

HinTak avatar Jul 22 '20 18:07 HinTak

I don't know if it is relevant, but launchd often crashes at shutdown, ("launchctl shutdown" also gives a few delayed messages), and systemd- coredump often tries to collect the coredump then give up, realizing launchd is not even a elf executable nor one installed by the system's package management system (fedora).

It might be systemd doing too much. - But as I wrote, it wipes only files owned by my user account, and only on per-user mounts like udisks2-managed external drives and gnome session tmpfs's.

HinTak avatar Jul 23 '20 15:07 HinTak

and systemd- coredump often tries to collect the coredump then give up, realizing launchd is not even a elf executable nor one installed by the system's package management system (fedora).

This is weird, because it's not supposed to limit coredumps to either system executables not ELFs. It has always worked for me.

It might be systemd doing too much.

It's extremely unlikely that systemd is involved in this at all.

What this could be is that either launchd is doing something weird, or this misbehaves with your custom DPREFIX:

https://github.com/darlinghq/darling/blob/fa5348c8a9b338746b91a53a801343a8d046a66d/src/startup/darling.c#L948-L962

bugaevc avatar Jul 23 '20 15:07 bugaevc

  1. First of all, launchctl shutdown is not a supported / recommended way of shutting down with Darling. I donẗ think we've ever tested this.

  2. I'd looks for clues inside launchd's source code. It is very well possible that it attempts to clean certain locations during shutdown and something doesn't work as expected.

LubosD avatar Jul 23 '20 15:07 LubosD

:-( . /var/run is a symlink to ../run, i.e. /run and the symlink is part of fedora's filesystem package, and therefore on every fedora system, at least. :-(

HinTak avatar Jul 23 '20 15:07 HinTak

/var/run is a symlink to ../run, i.e. /run

I know, this is why I'm saying it is a possibility that this wiping of directories on startup may be to blame. But note that we're wiping $DPREFIX/var/run and $DPREFIX/var/tmp, not /var/run and /var/tmp on the host.

bugaevc avatar Jul 23 '20 15:07 bugaevc

Argh - https://github.com/darlinghq/darling/issues/632 - at least earlier, it is useful to be able to unload the kernel module. (which is kept in use by launchd, even after darling shell exits)

HinTak avatar Jul 23 '20 15:07 HinTak

That code is in start-up - I suppose my question is that does launchd also has the tendency of re-exec itself on emergency and try to relaunch itself, like systemd and other initd?

HinTak avatar Jul 23 '20 15:07 HinTak

I have created https://github.com/darlinghq/darling/pull/852 to limit recursive file-wiping to known files (from my two DPREFIX's) created by darling.

Regardless of whether that's the cause, I think recursive wipe is too dangerous. It is tedious but better to explicitly list all known files that needs to go.

HinTak avatar Jul 23 '20 17:07 HinTak

"/var/run" is on the bottom of the overlay so it does not need to go and likely won't go. I haven't figured out what creates "/var/tmp" itself, but I'd prefer not to touch it if unsure.

HinTak avatar Jul 23 '20 17:07 HinTak

@LubosD - I see in the startup code, there is an undocumented usage of doing darling shutdown which kills the background init process. I see that might be what I was trying to do with launchctl shutdown. In any case, it would be useful to document what you do for shutting down darling - to the extent that the kernel module could be unloaded too.

HinTak avatar Jul 23 '20 18:07 HinTak

https://github.com/darlinghq/darling/blob/fa5348c8a9b338746b91a53a801343a8d046a66d/src/launchd/support/launchctl.c#L2438 @LubosD @bugaevc here launchctl has very similar code to wipe /var/run and /var/tmp as darling startup. It seems to be run when you do launchctl load -S System -D all .

But I am inclined to think the native code in darling start up is the culprit, rather than launchctl which is mach-o in container, given both try to wipe /var/run and /var/tmp .

HinTak avatar Jul 23 '20 23:07 HinTak

There is always the risk of either of the two recursive wipe escaping DPREFIX and recursing down the host's /run. I would actually put the same change as in https://github.com/darlinghq/darling/pull/852 and duplicate that in the launchctl just to be on the safe side.

HinTak avatar Jul 23 '20 23:07 HinTak

The one in launchd was found by grep'ing for readdir . I would suggest a systematic review of all usage of readdir elsewhere too - mostly they are just recursive searches and somewhat harmless if escaped to host filesystem, but recursive delete is definitely dangerous. I don't care how much trouble it is to maintain a different version and a permanent sets of patches compared to apple open-source's, but I'd like to see every such use examined and possibly modified to "not delete".

I am inclined to remove / cripple both of the recursive deletes for my next usage of darling, if I do any time soon again.

HinTak avatar Jul 25 '20 17:07 HinTak

Besides /var/run, it appears that one or two semi-permanent caches files I noticed that KDE puts in /var/tmp is gone too. (one or two of kde app seems to use /var/tmp for per-user cache files) . On my system, /var/tmp is just a plain sub directory of /. /tmp is tmpfs and separate.

HinTak avatar Jul 25 '20 17:07 HinTak

There are two other ways of guarding against this sort of issues:

  • actually create two tmpfs for those two directories, and mount at start up.

  • change the wipe logic not to cross file system boundaries. This does not help /var/tmp, but would help /run/media/myname/myexternaldrive

HinTak avatar Jul 25 '20 21:07 HinTak

Hello there. I've been trying to replicate this issue but simply can't. Not on a VM, not on my work computer. From what I understand, there are sometimes extreme coincidences that could cause such an issue and sometimes human error. I don't believe this has anything to do with darling. I came across this issue because I just wiped my whole drive while running darling but later realized it was because of something else I was running along with it. It is quite easy to break things on linux considering the degree of freedom it has as root whereas MacOS prevents users from doing anything dumb accidentally.

acheong08 avatar Jan 26 '22 13:01 acheong08