kAFL icon indicating copy to clipboard operation
kAFL copied to clipboard

Slave has Died issue

Open 5angjun opened this issue 2 years ago • 5 comments

Hello, I'm sangjun who is very interested in this project.

However, I want to know how to fix some error in Manager & Workers Communication.

I want to make died process to restart new qemu and reconnect fuzzing process when slave has died.

Dying slaves is very critical when i try to fuzzing very long hours ex) over 6hours.

So i think what needs to be impored is to re-engage dead workers in the fuzzing process.

Is any idea of this??

    def wait(self, timeout=None):
        results = []
        r, w, e = select.select(self.clients, (), (), timeout)
        for sock_ready in r:
            if sock_ready == self.listener:
                c = self.listener.accept()
                self.clients.append(c)
                self.clients_seen += 1
            else:
                try:
                    msg = sock_ready.recv_bytes()
                    msg = msgpack.unpackb(msg, strict_map_key=False)
                    results.append((sock_ready, msg))
                except (EOFError, IOError):
                    sock_ready.close()
                    self.clients.remove(sock_ready)
                    self.logger.info("Worker disconnected (remaining %d/%d)." % (len(self.clients)-1, self.clients_seen))
                    if len(self.clients) == 1:
                        raise SystemExit("All Workers exited.")
        return results
     ```

5angjun avatar Oct 11 '23 09:10 5angjun

Hi @5angjun,

I think we need to understand why the slaves (or Workers) are dying in the first place ? That shouldn't happen.

You can get more logging information with --log and combine it with --debug to extract useful debug output.

cc @il-steffen , can we expect dying workers during a fuzzing campaign ? Something I'm missing ?

Wenzel avatar Oct 11 '23 09:10 Wenzel

Worker exit can happen on Qemu segfault or unhandled exception in the worker / mutation logic. The above logic is only to handle the loss of the socket connection, you need to look at why the worker exited..

il-steffen avatar Oct 11 '23 09:10 il-steffen

I think the error occured when qemu died.

The last died code is this.

    def run_qemu(self):
        self.control.send(b'x')
        self.control.recv(1)

So i think it is nice to restart fuzzing campaign when qemu die.

5angjun avatar Oct 11 '23 10:10 5angjun

Please have a look why this is happening. In general we want to fix anything that causes workers to die during a fuzzing campaign. There are cases where restarting won't help, for instance if the disk is full then Qemu will just exit again on next file/log write.

In some cases there may be Qemu segfault that is not easy to fix, for instance we had bugs related to specific virtio fuzzing harnesses where fixing Qemu did not make much sense. In this case it would make sense to catch + restart the worker. This should be possible from the manager, and then the fuzzing campaign can just continue running.

The manager main loop is here: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/manager.py#L85 We enter this just after launching the workers: https://github.com/IntelLabs/kafl.fuzzer/blob/master/kafl_fuzzer/manager/core.py#L104 The workers are python threads which in turn launch Qemu sub-processes. The threads should abort normally on Qemu communication error or uncatched exceptions, so you should be able to detect and restart the thread with same settings.

With some luck, the socket connection code you referenced above should detect the new worker and the main loop will start dispatching jobs again.

il-steffen avatar Oct 11 '23 11:10 il-steffen

This situation appears when allocating a lot of RAM to a vm image and performing parallel fuzzing. In my case, this problem appeared while fuzzing the Windows built-in driver for a long time.

For example, my host computer's RAM size is 84G, but when I allocated 10G of RAM to each vm and fuzzed it with 8 cores ( use almost 82G / 84G ), qemu or worker died (there is a high probability that qemu died).

But the manager process is still alive. I am thinking about how to modify the code to revive dead workers in the manager process.

As a person who loves kAFL, I will also think about how to modify kAFL to make it a masterpiece.😀😀😀

Thank

5angjun avatar Oct 11 '23 13:10 5angjun