unicorn icon indicating copy to clipboard operation
unicorn copied to clipboard

Performance difference of Unicorn 1 and 2

Open boborjan2 opened this issue 1 year ago • 10 comments
trafficstars

We have been using unicorn 1 for a while and are in the process of switching to unicorn v2 due to some bugs already fixed there etc. I have performed a simple benchmark (using qemu's test-i386.c without the printfs run a few thousand times in a loop). Unicorn2 (branch dev) is compiled by 'cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" .' to enable all optimization we can get.

Interestingly Unicorn 1 is a magnitude faster: ~3.6s vs ~99s on my setup. I checked the milestones and included PR #1838 -> 49.15s Even including #1839 (I am not sure if it's going to be merged) -> 4.8s

This last is comparable to v1. I assume the difference is caused bu using QEMU 5 vs 2.x. What features does v5 have that justifies this? (btw I also tried uc_ctl_tlb_mode(uc, UC_TLB_VIRTUAL) but that just makes execution slower(?))

Benchmarks uses UC_ARCH_X86, UC_MODE_32.

Any comment is welcome, Thanks, Viktor

boborjan2 avatar Jul 03 '24 13:07 boborjan2

What’s your benchmark code and did you try dev branch?

wtdcode avatar Jul 03 '24 15:07 wtdcode

Hi, I use this code: https://github.com/qemu/qemu/blob/master/tests/tcg/i386/test-i386.c in a loop of 100000, embedded in a 32bit windows exe that is loaded to unicorn. It is compiled with -O0, printfs omitted. I guess a simpler example would be more welcome.? Yes, I use today's tip of dev branch. Btw I made a profiling using gprof, this is the top (this is with all the PRs mentioned up there): 29.34 0.49 0.49 helper_lookup_tb_ptr_x86_64 17.37 0.78 0.29 qht_lookup_custom 12.57 0.99 0.21 tb_htable_lookup_x86_64 9.58 1.15 0.16 cpu_exec_x86_64 8.38 1.29 0.14 tb_lookup_cmp

boborjan2 avatar Jul 03 '24 15:07 boborjan2

Unicorn doesn't have system emulation, how do you deal with syscalls?

wtdcode avatar Jul 03 '24 15:07 wtdcode

the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.

boborjan2 avatar Jul 03 '24 16:07 boborjan2

I try to extract the test code loop and create a stand-alone .c to make it easier to reproduce.

boborjan2 avatar Jul 03 '24 16:07 boborjan2

I don’t have any specific clue before having the concrete benchmark code. Maybe you can pprof the slowest version and see bottlenecks.


From: boborjan2 @.> Sent: Thursday, July 4, 2024 12:15:04 AM To: unicorn-engine/unicorn @.> Cc: lazymio @.>; Comment @.> Subject: Re: [unicorn-engine/unicorn] Performance difference of Unicorn 1 and 2 (Issue #1970)

the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.

― Reply to this email directly, view it on GitHubhttps://github.com/unicorn-engine/unicorn/issues/1970#issuecomment-2206728663, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHJULO6BWVMPHDSTBKK3KTTZKQPQRAVCNFSM6AAAAABKJRHTPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG4ZDQNRWGM. You are receiving this because you commented.Message ID: @.***>

wtdcode avatar Jul 03 '24 16:07 wtdcode

I extracted a subset of the test suite and injected it to the shellcode sample. I reduced the test to this simple case:

static inline void test_bsx(void) __attribute__((always_inline));
static inline void test_bsx(void)
{
    TEST_BSX(bsrw, "w", 0);
    TEST_BSX(bsrw, "w", 0x12340128);
    TEST_BSX(bsfw, "w", 0);
    TEST_BSX(bsfw, "w", 0x12340128);
    TEST_BSX(bsrl, "k", 0);
    TEST_BSX(bsrl, "k", 0x00340128);
    TEST_BSX(bsfl, "k", 0);
    TEST_BSX(bsfl, "k", 0x00340128);
}
void test2(void)
{
    for(int i = 0; i < 20000000; i++) {
        test_bsx();
    }
}

I compiled it with -O0 and extracted test2 code into a c array and loaded into the shellcode sample:

#include <unicorn/unicorn.h>
#include <string.h>
const uint8_t test_code[276] = {
    0x55, 0x89, 0xE5, 0x83, 0xEC, 0x70, 0xC7, 0x45, 0xFC, 0x00, 0x00, 0x00, 0x00, 0xE9, 0xF1, 0x00,
    0x00, 0x00, 0xC7, 0x45, 0xF8, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xF8, 0x31, 0xC0, 0xBA, 0x78,
    0x56, 0x34, 0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xF4, 0x89, 0x45, 0xF0,
    0xC7, 0x45, 0xEC, 0x28, 0x01, 0x34, 0x12, 0x8B, 0x4D, 0xEC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34,
    0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xE8, 0x89, 0x45, 0xE4, 0xC7, 0x45,
    0xE0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xE0, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66,
    0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xDC, 0x89, 0x45, 0xD8, 0xC7, 0x45, 0xD4, 0x28,
    0x01, 0x34, 0x12, 0x8B, 0x4D, 0xD4, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66, 0x0F, 0xBC,
    0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xD0, 0x89, 0x45, 0xCC, 0xC7, 0x45, 0xC8, 0x00, 0x00, 0x00,
    0x00, 0x8B, 0x4D, 0xC8, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94,
    0xC0, 0x89, 0x55, 0xC4, 0x89, 0x45, 0xC0, 0xC7, 0x45, 0xBC, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D,
    0xBC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55,
    0xB8, 0x89, 0x45, 0xB4, 0xC7, 0x45, 0xB0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xB0, 0x31, 0xC0,
    0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xAC, 0x89, 0x45,
    0xA8, 0xC7, 0x45, 0xA4, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D, 0xA4, 0x31, 0xC0, 0xBA, 0x78, 0x56,
    0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xA0, 0x89, 0x45, 0x9C, 0x90, 0x83,
    0x45, 0xFC, 0x01, 0x81, 0x7D, 0xFC, 0xFF, 0x2C, 0x31, 0x01, 0x0F, 0x8E, 0x02, 0xFF, 0xFF, 0xFF,
    0x90, 0x90, 0xC9, 0xC3,
};

// memory address where emulation starts
#define ADDRESS 0x1000000

#define MIN(a, b) (a < b ? a : b)

static void test_i386(void)
{
    uc_engine *uc;
    uc_err err;

    int r_esp = ADDRESS + 0x200000; // ESP register

    printf("Emulate i386 code\n");

    // Initialize emulator in X86-32bit mode
    err = uc_open(UC_ARCH_X86, UC_MODE_32, &uc);
    if (err) {
        printf("Failed on uc_open() with error returned: %u\n", err);
        return;
    }

    // map 2MB memory for this emulation
    uc_mem_map(uc, ADDRESS, 2 * 1024 * 1024, UC_PROT_ALL);

    // write machine code to be emulated to memory
    if (uc_mem_write(uc, ADDRESS, test_code,
                     sizeof(test_code) - 1)) {
        printf("Failed to write emulation code to memory, quit!\n");
        return;
    }

    // initialize machine registers
    uc_reg_write(uc, UC_X86_REG_ESP, &r_esp);

    // emulate machine code in infinite time
    // err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(X86_CODE32_SELF), 0,
    // 12); <--- emulate only 12 instructions
    err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(test_code) - 2, 0, 0);
    if (err) {
        printf("Failed on uc_emu_start() with error returned %u: %s\n", err,
               uc_strerror(err));
    }

    printf("\n>>> Emulation done.\n");

    uc_close(uc);
}

int main(int argc, char **argv, char **envp)
{
    test_i386();

    return 0;
}

The performance differences are approx. the same as above ow even worse. I bechmark it with "time ./shellcode". Unicorn2 is compiled with "cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" ." as above.

boborjan2 avatar Jul 04 '24 09:07 boborjan2

The difference comes in when there are memory writes. Consider the following:

sub     esp, 0x10
mov     eax, 0x10000000
mov     dword [e], 1	 // local var on stack
sub     eax, 1
jne     0x405cc9
add     esp, 0x10
ret

And this:

sub     esp, 0x10
mov     eax, 0x10000000
nop
[...] 8 times 
nop
sub     eax, 1
jne     0x405cc9
add     esp, 0x10
ret

The latter sample emulates in approx the same time (within 5%) with unicorn 1 and 2. The former takes ~50x(!) more time with unicorn 2 (current dev branch). Here is the benchmark code:

#include <unicorn/unicorn.h>
#include <string.h>

/*
0x00405cc1      83ec10             sub     esp, 0x10
0x00405cc4      b800000010         mov     eax, 0x10000000
0x00405cc9      c744240c01000000   mov     dword [e], 1
0x00405cd1      83e801             sub     eax, 1
0x00405cd4      75f3               jne     0x405cc9
0x00405cd6      83c410             add     esp, 0x10
0x00405cd9      c3                 ret
*/

const uint8_t simple_loop_memwrite[] = {
    0x83, 0xec, 0x10,
    0xb8, 0x00, 0x00, 0x00, 0x10, /* loop count 0x10000000 */
    0xc7, 0x44, 0x24, 0x0c, 0x01, 0x00, 0x00, 0x00,
    0x83, 0xe8, 0x01,
    0x75, 0xf3,
    0x83, 0xc4, 0x10,
    0xc3
};

const uint8_t simple_loop_nops[] = {
    0x83, 0xec, 0x10,
    0xb8, 0x00, 0x00, 0x00, 0x80, /* loop count 0x80000000 */
    0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
    0x83, 0xe8, 0x01,
    0x75, 0xf3,
    0x83, 0xc4, 0x10,
    0xc3
};

// memory address where emulation starts
#define ADDRESS 0x1000000
static void test_i386(const uint8_t *code, unsigned code_size)
{
    uc_engine *uc;
    uc_err err;

    int r_esp = ADDRESS + 0x200000; // ESP register

    printf("Emulate i386 code\n");

    // Initialize emulator in X86-32bit mode
    err = uc_open(UC_ARCH_X86, UC_MODE_32, &uc);
    if (err) {
        printf("Failed on uc_open() with error returned: %u\n", err);
        return;
    }

    // map 2MB memory for this emulation
    uc_mem_map(uc, ADDRESS, 2 * 1024 * 1024, UC_PROT_ALL);

    // write machine code to be emulated to memory
    if (uc_mem_write(uc, ADDRESS, code, code_size)) {
        printf("Failed to write emulation code to memory, quit!\n");
        return;
    }
    // initialize machine registers
    uc_reg_write(uc, UC_X86_REG_ESP, &r_esp);

    err = uc_emu_start(uc, ADDRESS, ADDRESS + code_size - 1, 0, 0);
    if (err) {
        printf("Failed on uc_emu_start() with error returned %u: %s\n", err,
               uc_strerror(err));
    }

    printf("\n>>> Emulation done.\n");

    uc_close(uc);
}

int main(int argc, char **argv, char **envp)
{
    //test_i386(simple_loop_nops, sizeof(simple_loop_nops));
    test_i386(simple_loop_memwrite, sizeof(simple_loop_memwrite));

    return 0;
}

boborjan2 avatar Jul 11 '24 09:07 boborjan2

Please check this commit of unicorn-for-efi: https://github.com/intel/unicorn-for-efi/commit/6be64eb2d3a30095782d55daac951ef7cb59dd37 Applying this solves the performance issues described above. In some cases unicorn 2 with this patch is even faster than unicorn 1.

boborjan2 avatar Jul 11 '24 11:07 boborjan2

Please check this commit of unicorn-for-efi: intel/unicorn-for-efi@6be64eb Applying this solves the performance issues described above. In some cases unicorn 2 with this patch is even faster than unicorn 1.

Thanks for trying out. Looks like we should backport that.

Link to #1924

wtdcode avatar Jul 11 '24 12:07 wtdcode

Port in d01035767e47f69b4e545d1843dbaf08e6a74752

wtdcode avatar Sep 21 '24 13:09 wtdcode