unicorn Unicorn on Windows takes 1GB of RAM when just instantiating an Emulator and registering a hook

Hello! So, I just create an instance and register the hook without actually mapping any memory or executing single opcode, and it's already +1gb of ram on Windows. ~~With rougly the same code in C it's only 11mb, so I think the problem could be somewhere in Rust code.~~ UPD: it's actually not related to Rust bindings.

use unicorn_engine::unicorn_const as ucc;
use unicorn_engine::Unicorn;

fn eat_1gb() {
    let uni = Unicorn::new(ucc::Arch::X86, ucc::Mode::MODE_64);
    if uni.is_err() {
        println!("Unable to create unicorn instance");
        return;
    }
    let mut emu = uni.unwrap();
    let hook = emu.add_mem_hook(
        ucc::HookType::MEM_UNMAPPED,
        0,
        u64::MAX,
        |_uc, _access, _addr, _size, _value| {
            true
        },
    );
    std::thread::sleep(std::time::Duration::from_secs(1));
    println!("1GB allocated");

    emu.remove_hook(hook.unwrap()).unwrap();
}
fn main() {
    for i in 0..30 {
        eat_1gb();
        println!("Iteration {}, check ram usage...", i);
        // sleep for 1 second
        std::thread::sleep(std::time::Duration::from_secs(1));
        println!("1GB freed");
    }
}

Latest version is used:

[dependencies]
unicorn-engine = "2.0.0"

vmconnect_0aEYJovtJI

Sep 11 '22 20:09 expend20

Tested the C bindings and it reproduces with 2.0.0 (commit hash 6c1cbef6ac505d355033aef1176b684d02e1eb3a). It looks like there is a gigantic 1GB RWX page allocated.

Sep 11 '22 20:09 mrexodia

Oh, sorry for that, actually not a bindings issue. Let me rename the issue then.

Sep 11 '22 20:09 expend20

This is the TCG buffer. Look at qemu/accel/tcg/translate-all.c

Not sure if this is a real issue, because the memory is only allocated and not used (not sure how windows behaves in this case).

Sep 23 '22 09:09 PhilippTakacs

Yes, this is expected since it's the TCG buffer. On Windows, IIRC, the pages are allocated on demand. Meaning, even if you start several unicorn instances and allocate a few GB memory, your machine won't really run out of physical memory.

Sep 25 '22 15:09 wtdcode

This is kind of true, but not exactly. You can reserve pages and then it’s guaranteed to not use memory.

Sep 25 '22 16:09 mrexodia

This is kind of true, but not exactly. You can reserve pages and then it’s guaranteed to not use memory.

I haven't played with VritualAlloc for a very long time but we indeed MEM_RESERVE, which I think is pretty enough?

Sep 25 '22 16:09 wtdcode

I’ll confirm, but that’s not what it looked like in Process Hacker…

Sep 25 '22 16:09 mrexodia

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

Nov 25 '22 05:11 github-actions[bot]

Not stale

Nov 25 '22 07:11 mrexodia

I also ran into this and looked a bit into it, the assumption that Windows will only reserve and not allocate is not true, the flags passed to VirtualAlloc are MEM_RESERVE and MEM_COMMIT so that memory is definitely allocated. I ran into this issue as I wanted to emulate/simulate multiple threads by having multiple instances and having 32 threads means its eating 32 GiB. It might be a good idea to allow the user to specify the buffer size. I would be willing to contribute this change but I'm uncertain to what code I can modify safely without diverging too much from Qemu.

Dec 06 '22 19:12 ZehMatt

I also ran into this and looked a bit into it, the assumption that Windows will only reserve and not allocate is not true, the flags passed to VirtualAlloc are MEM_RESERVE and MEM_COMMIT so that memory is definitely allocated. I ran into this issue as I wanted to emulate/simulate multiple threads by having multiple instances and having 32 threads means its eating 32 GiB. It might be a good idea to allow the user to specify the buffer size. I would be willing to contribute this change but I'm uncertain to what code I can modify safely without diverging too much from Qemu.

If so, what's the correct flags here?

Dec 06 '22 21:12 wtdcode

There isn’t really a flag that does this. You could basically MEM_RESERVE a range and then register a vectored exception handler that MEM_COMMITs the ranges that you access.

This obviously only works if you don’t do stuff like memset the whole range though…

Dec 06 '22 22:12 mrexodia

There isn’t really a flag that does this. You could basically MEM_RESERVE a range and then register a vectored exception handler that MEM_COMMITs the ranges that you access.

This obviously only works if you don’t do stuff like memset the whole range though…

Oh I see, I could get a fix for that.

Dec 06 '22 22:12 wtdcode

I got a fix for this, see this for some explanation and caveats.

With this fix, each instance will take 512KB of memory firstly and increase the memory usage on demand. I will remain this issue open until next release for possible feedback.

Jan 28 '23 21:01 wtdcode

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

Mar 30 '23 05:03 github-actions[bot]

Not stale 😊

Mar 30 '23 06:03 vrubleg

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

May 30 '23 05:05 github-actions[bot]

Still worth to keep it 😊

May 30 '23 05:05 vrubleg

Still not receiving any more feedback, xd

May 30 '23 11:05 wtdcode

There isn’t really a flag that does this. You could basically MEM_RESERVE a range and then register a vectored exception handler that MEM_COMMITs the ranges that you access.

This would be a solution. I don’t have much time to check the code recently, but let me know if you have any questions about how to do this…

May 30 '23 11:05 mrexodia

There isn’t really a flag that does this. You could basically MEM_RESERVE a range and then register a vectored exception handler that MEM_COMMITs the ranges that you access.

This would be a solution. I don’t have much time to check the code recently, but let me know if you have any questions about how to do this…

I once tried this but finally gave up. IIRC, it’s due to the fact that we don’t have a good place to write the big try-catch.

May 30 '23 11:05 wtdcode

You could use AddVectoredExceptionHandler to register an exception handler, something like this:

// TODO: these have to be set during initialization
char* jitSectionPtr;
ULONG_PTR jitSectionSize;

static LONG MyHandler(_EXCEPTION_POINTERS *ExceptionInfo) {
  auto record = ExceptionInfo->ExceptionRecord;
  if(record->ExceptionCode == EXCEPTION_ACCESS_VIOLATION) {
    auto address = (char*)record->ExceptionInformation[1];
    if(address >= jitSectionPtr && address < jitSectionPtr + jitSectionSize) {
      // TODO: VirtualAlloc to commit the page
      return EXCEPTION_CONTINUE_EXECUTION;
    }
  }
  return EXCEPTION_CONTINUE_SEARCH;
}
void initialize() {
  AddVectoredExceptionHandler(0, MyHandler);
}

On 64-bit targets you can reserve an arbitrary size, on 32-bit your address space is limited to 2/4GB so this solution wouldn't improve anything.

May 30 '23 13:05 mrexodia

I see and I will have a look.

May 30 '23 13:05 wtdcode

AddVectoredExceptionHandler

I finally recall why I give up on this approach - we need some mechanism to generate every handler for every unicorn instance, i.e., we need closures because we need to wrap every distance uc object, or we might commit other instance's memory wrongly.

A possible workaround is to share the same handler for all instances and commit the memory anyway but it might make things worse(?)

Jun 10 '23 08:06 wtdcode

I would say that either you share the whole RWX section between all instances, in which case you can just commit on access when the memory is in the range.

Alternatively you would have a range per instance, so it’s a matter of saving them in a global and iterating all instances and check the range.

Jun 10 '23 08:06 mrexodia

I would say that either you share the whole RWX section between all instances, in which case you can just commit on access when the memory is in the range.

Alternatively you would have a range per instance, so it’s a matter of saving them in a global and iterating all instances and check the range.

Both your solutions require a place to record the global information across all instances, which breaks a few our assumptions, especially some bindings do. Other solution is to get a simple closure implementation, either by introducing libffi which is ubiquitous or implementing a simple one. I will investigate a bit more and thanks for your help!

Jun 10 '23 08:06 wtdcode

I don't see how this relates to the bindings. You cannot register an exception handler with state (eg closure). They are process-wide so if you want to use them you will need to store some global state to get back to the uc instance for that memory range. The alternative would be to properly implement this in qemu, but this is unlikely to be easier.

Jun 10 '23 08:06 mrexodia

you will need to store some global state to get back to the uc instance

That's one of the way how closures work, no?

Jun 10 '23 09:06 wtdcode

I implement the demand paging via seh handlers and naive closures trampoline here: https://github.com/unicorn-engine/unicorn/commit/3d5b2643f0af742d9b90b4511d0ee137775c8526#diff-842456abe9564ae1e7d75ab8f322be6c27ca3c512e445a18e5898dea68ad9799R872 Let's see how CI says though everything works on my machine.

Looking forward to your feedback!

Jun 10 '23 12:06 wtdcode

All windows CI passed and this solution doesn't involve any bad hacks and thus I think this issue could be closed.

Ping me if there is any bug.

Jun 10 '23 13:06 wtdcode