unicorn icon indicating copy to clipboard operation
unicorn copied to clipboard

Segfault in tb_target_set_jmp_target_arm

Open futhewo opened this issue 1 year ago • 13 comments

In a very long emulation, I have a segfault (dereferencing null pointer) in qemu/tcg/aarch64/tcg-target.inc.c:tb_target_set_jmp_target.

In qemu/accel/tcg/cpu-exec.c, the function tb_set_jmp_target calls the previous function tb_target_set_jmp_target with parameters tc_ptr = 0 and tc_ptr + offset = 0.

Then tb_target_set_jmp_target calls atomic_set((uint64_t*)jmp_addr_, pair) with jmp_addr = tc_ptr + offset = 0, which segfaults.

I patched it (dirty) by adding the following code in tb_set_jmp_target: if (TCG_TARGET_HAS_direct_jump && tb->tc.ptr) { […] }

I do not know what this patch may break, but it solves the crash. I am on commit 6ae0c97.

Feel free to ask me anything.

futhewo avatar Oct 31 '24 09:10 futhewo

Thanks for the report and investigation. Your version seems pretty old, how about current dev branch?

wtdcode avatar Oct 31 '24 10:10 wtdcode

I have also encountered the same problem, and this issue has been very helpful to me

1144822034 avatar Nov 08 '24 07:11 1144822034

Unfortunately, I cannot try on a newer unicorn version, sorry. I am using unicorn as part of unicornafl++ and it uses the version I pointed out.

futhewo avatar Nov 11 '24 10:11 futhewo

unicornafl++

What is that? Are you mentioning unicornafl?

wtdcode avatar Nov 11 '24 11:11 wtdcode

Yes, the one that is part of afl++, that you may find here: https://github.com/AFLplusplus/unicornafl

futhewo avatar Nov 11 '24 17:11 futhewo

Yes, the one that is part of afl++, that you may find here: https://github.com/AFLplusplus/unicornafl

Okay, unicornafl will bump to 2.1.2 once I fixed it.

wtdcode avatar Nov 11 '24 17:11 wtdcode

That would be great! Thanks.

futhewo avatar Nov 11 '24 17:11 futhewo

Tagging https://github.com/unicorn-engine/unicorn/issues/1923 for visibility. My temporary workaround is to destroy and reconstruct the emulator once every few million times.

sadeli413 avatar Dec 09 '24 14:12 sadeli413

I assume https://github.com/unicorn-engine/unicorn/commit/2c688bae734b1e63b2e2fd5d715c1070af722651 shall solve this. Could you have a try @futhewo ?

wtdcode avatar Jan 04 '25 10:01 wtdcode

Hi,

Experienced same segfault on aarch64 target. After investigation the null tc.ptr comes from the last_tb in tb_find.

Analysis

The TCG highwater has been lowered.

void tcg_prologue_init(TCGContext *s)
{
...
    s->code_gen_highwater = (char *)s->code_gen_buffer + (total_size - TCG_HIGHWATER) - s->uc->qemu_real_host_page_size;
}

Thus tcg_gen_code called from tb_find will fail tb_alloc because of a lower highwater. However the last_tb is still set within tb_find.

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
}

Back from tb_gen_code in tb_find, the call to tb_add_jump/tb_set_jmp_target will trigger the fault:

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
    if (last_tb) {
        tb_add_jump(last_tb, tb_exit, tb);
	    	tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
			SEGV
    }
}

While my baseline is d568885d64c89db5b9a722f0c1bef05aa92f84ca. I'm using Sai Cao patch:

git show d01904d3f33bd58c54271a7e28c402dbaddda847
commit d01904d3f33bd58c54271a7e28c402dbaddda847
Author: Sai Cao <[email protected]>
Date:   Thu Dec 19 11:29:20 2024 +0800

    Fix bug that tcg_region_alloc tbs were not flush
    this cause uc_ctl_remove_cache remove a invalid tbs.

diff --git a/qemu/tcg/tcg.c b/qemu/tcg/tcg.c
index dcacec7c..2b643b49 100644
--- a/qemu/tcg/tcg.c
+++ b/qemu/tcg/tcg.c
@@ -808,6 +808,7 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
         if (tcg_region_alloc(s)) {
             return NULL;
         }
+        tb_flush(s->uc->cpu);
         goto retry;
     }
     s->code_gen_ptr = next;

The tb_flush done during the failed alloc, because of a lower highwater, is reponsible for invalidating the last_tb in tb_find whithout letting it know about.

Candidate fix

We could eventually check last_tb.tc.ptr before calling set_jmp_target. But it is safer to detect a tb_flush earlier and reset last_tb such as:

diff --git a/qemu/accel/tcg/cpu-exec.c b/qemu/accel/tcg/cpu-exec.c
index fc47eecb..bf9a8c3b 100644
--- a/qemu/accel/tcg/cpu-exec.c
+++ b/qemu/accel/tcg/cpu-exec.c
@@ -259,11 +259,20 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL||tb->cflags&CF_NOCACHE) {
         mmap_lock();
+
+        unsigned tb_flush_count = uc->tcg_ctx->tb_ctx.tb_flush_count;
+
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         cpu->tb_jmp_cache[tb_jmp_cache_hash_func(cpu->uc, pc)] = tb;
 
+        if (uc->tcg_ctx->tb_ctx.tb_flush_count != tb_flush_count) {
+            uc->last_tb = last_tb = NULL;
+        }
+
         if (uc->last_tb) {
             UC_TB_COPY(&cur_tb, tb);
             UC_TB_COPY(&prev_tb, uc->last_tb);

samuel-beckett avatar Mar 04 '25 08:03 samuel-beckett

Hi,

Experienced same segfault on aarch64 target. After investigation the null tc.ptr comes from the last_tb in tb_find.

Analysis

The TCG highwater has been lowered.

void tcg_prologue_init(TCGContext *s)
{
...
    s->code_gen_highwater = (char *)s->code_gen_buffer + (total_size - TCG_HIGHWATER) - s->uc->qemu_real_host_page_size;
}

Thus tcg_gen_code called from tb_find will fail tb_alloc because of a lower highwater. However the last_tb is still set within tb_find.

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
}

Back from tb_gen_code in tb_find, the call to tb_add_jump/tb_set_jmp_target will trigger the fault:

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
    if (last_tb) {
        tb_add_jump(last_tb, tb_exit, tb);
	    	tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
			SEGV
    }
}

While my baseline is d568885d64c89db5b9a722f0c1bef05aa92f84ca. I'm using Sai Cao patch:

git show d01904d3f33bd58c54271a7e28c402dbaddda847
commit d01904d3f33bd58c54271a7e28c402dbaddda847
Author: Sai Cao <[email protected]>
Date:   Thu Dec 19 11:29:20 2024 +0800

    Fix bug that tcg_region_alloc tbs were not flush
    this cause uc_ctl_remove_cache remove a invalid tbs.

diff --git a/qemu/tcg/tcg.c b/qemu/tcg/tcg.c
index dcacec7c..2b643b49 100644
--- a/qemu/tcg/tcg.c
+++ b/qemu/tcg/tcg.c
@@ -808,6 +808,7 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
         if (tcg_region_alloc(s)) {
             return NULL;
         }
+        tb_flush(s->uc->cpu);
         goto retry;
     }
     s->code_gen_ptr = next;

The tb_flush done during the failed alloc, because of a lower highwater, is reponsible for invalidating the last_tb in tb_find whithout letting it know about.

Candidate fix

We could eventually check last_tb.tc.ptr before calling set_jmp_target. But it is safer to detect a tb_flush earlier and reset last_tb such as:

diff --git a/qemu/accel/tcg/cpu-exec.c b/qemu/accel/tcg/cpu-exec.c
index fc47eecb..bf9a8c3b 100644
--- a/qemu/accel/tcg/cpu-exec.c
+++ b/qemu/accel/tcg/cpu-exec.c
@@ -259,11 +259,20 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL||tb->cflags&CF_NOCACHE) {
         mmap_lock();
+
+        unsigned tb_flush_count = uc->tcg_ctx->tb_ctx.tb_flush_count;
+
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         cpu->tb_jmp_cache[tb_jmp_cache_hash_func(cpu->uc, pc)] = tb;
 
+        if (uc->tcg_ctx->tb_ctx.tb_flush_count != tb_flush_count) {
+            uc->last_tb = last_tb = NULL;
+        }
+
         if (uc->last_tb) {
             UC_TB_COPY(&cur_tb, tb);
             UC_TB_COPY(&prev_tb, uc->last_tb);

Could you try with commit 2128e01efc81404fecfcdfe1c7bb282ebd3e87d3? This includes a relevant fix to it.

This ensures we have correct tb_flush.

wtdcode avatar Mar 04 '25 08:03 wtdcode

Also I would appreciate it if you could also have a reproduction, even though it might be complex, not minimal. @samuel-beckett

wtdcode avatar Mar 04 '25 08:03 wtdcode

I assume 2c688ba shall solve this. Could you have a try @futhewo ?

Thanks a lot. I'll try that and let you know if the bug still occurs.

futhewo avatar Apr 17 '25 08:04 futhewo