ModUtils icon indicating copy to clipboard operation
ModUtils copied to clipboard

Fixed: Edge cases in Trampoline::FindAndAllocateMem

Open Sewer56 opened this issue 3 years ago • 27 comments

  • Allow allocation at end of free pages, such that regions over 2GB in size that are in sufficient proximity can be allocated inside.

  • Allow allocation / start search BEFORE the start of our executable in Virtual Memory, so mapping pages that come earlier is also possible.

This should fix some cases where e.g. loading the .NET Runtime (which reserves >2GB right after image/exe so it can efficiently allocate sequentially) would prevent the Trampoline allocator from functioning.

Sewer56 avatar Oct 25 '22 05:10 Sewer56

It's a solid idea, but I think it can potentially be done easier - I figure it should be possible to split scanning in two steps:

  1. First step stays as-is, scans from the address until it finds an address or hits a 2gb mark. For that, you don't need to scan the end of the free pages.
  2. If the first step fails to find a viable address, start from 0 or addr - 2GB if addr is big enough, and scan only ends of the free pages.

This should do what you want it to do, without the potential hit of a lot of useless scans when always scanning for 0 - if the executable is located gigabytes into the address space, it'll just be scanning all this memory for nothing since it's further away than 2gb anyway.

I'll think of a few tests for this and report back - I never needed them before but I think cases like this are a good excuse to write some.

CookiePLMonster avatar Oct 25 '22 08:10 CookiePLMonster

No complaints there, that would work alright; it would be a little bit more code [a few lines anyway] but would be faster performance wise; especially since there are usually much few page regions after the executable in memory in practice than what is there before it; the idea is pretty sound.

In retrospect I should have set initial address being -2GB in this PR, rather than 0; I guess I didn't think of it as this was the first thing I done after waking up, but I'll leave it to your discretion with what you want to try.

One thing of note of course is that this realistically is a one time cost, subsequent allocation requests will use the already allocated buffers in almost all scenarios in practice; and that will run fast.


Misc:

Allocation placement is also a fun consideration; it's technically possible for there to be a >2GB space of free pages before the EXE. Actually placing it as far as possible from the jump source so it gets allocated in the middle of that free region might not be a bad idea. While it does technically produce fragmentation of address space, it would make it less likely for big allocations to fill up that address area, forcing them instead to go after our binary in virtual memory.

Realistically it's a moot point though [like really, really irrelevant], we're talking really small odds within an edge case here, but the thought randomly came to mind so I typed it out.

Sewer56 avatar Oct 25 '22 09:10 Sewer56

Allocation placement is also a fun consideration; it's technically possible for there to be a >2GB space of free pages before the EXE. Actually placing it as far as possible from the jump source so it gets allocated in the middle of that free region might not be a bad idea. While it does technically produce fragmentation of address space, it would make it less likely for big allocations to fill up that address area, forcing them instead to go after our binary in virtual memory.

That's not something for us to worry about though, right? At least right now I can't imagine any scenarios where fragmenting on purpose is a good thing, especially since as it stands trampolines cannot grow in size by reserving memory that comes directly after the current trampoline; they can only create a new one elsewhere.

EDIT: Be wary of this one caveat of VirtualQuery:

The function returns the attributes and the size of the region of pages with matching attributes, in bytes. For example, if there is a 40 megabyte (MB) region of free memory, and VirtualQuery is called on a page that is 10 MB into the region, the function will obtain a state of MEM_FREE and a size of 30 MB.

If it wasn't for this exact quote, I would have recommended scanning backwards (start from addr and go back by size until you find a match, or exceed 2GB). But with this quote in mind I feel like starting at the -2GB mark might be safer. Again, that's something I need to verify locally on some tests.

CookiePLMonster avatar Oct 25 '22 09:10 CookiePLMonster

That's not something for us to worry about though, right?

Not really, it's too unrealistic, and just a fun thought I had in my head. I've worked on Reloaded.Hooks for quite a long while and have already experimented plenty in this area as I do a very similar thing of allocating 64K buffers and using them for ASM Hooks/Trampolines/Pointers etc. I do have a failsafe of using absolute jumps when it's not possible to jump in range but I've never seen it trigger once in a real program.

In Reloaded-II I find that there is never even a second allocation in practice; and that's with all mods reusing the same through a shared hooking library [it's more efficient, especially with startup time]. We're talking an excess of 2000 trampolines to even get close to a second allocation.

Sewer56 avatar Oct 25 '22 09:10 Sewer56

I do have a failsafe of using absolute jumps when it's not possible to jump in range but I've never seen it trigger once in a real program.

This is a good idea when you're rerouting an entire function, I don't think I ever considered that. However, it's not an option if you replace only a single call of some function.

In Reloaded-II I find that there is never even a second allocation in practice; and that's with all mods reusing the same through a shared hooking library [it's more efficient, especially with startup time]. We're talking an excess of 2000 trampolines to even get close to a second allocation.

One register-independent trampoline is 14 bytes (would've been 12 but I don't want to invalidate rax), trampoline's structure is set up at the very start of the allocation so that comes to (64KB - 12B) / 14 trampolines, for sure enough for any mod but you can theoretically exhaust it if you need to relocate some huge array/data structure to the trampoline area (that's why MakeTrampoline has an optional size parameter).

CookiePLMonster avatar Oct 25 '22 09:10 CookiePLMonster

Any update on this?

CookiePLMonster avatar Nov 23 '22 12:11 CookiePLMonster

Should I make changes based on the conversation in this thread?

I was originally under the impression you wanted to experiment a little while writing a few tests for this.

It's a relatively small change in the end, so whoever does it doesn't really matter.

Sewer56 avatar Nov 23 '22 16:11 Sewer56

Changes as described sound good, I didn't get to writing a dedicated test for this unfortunately.

CookiePLMonster avatar Nov 23 '22 19:11 CookiePLMonster

Any update?

CookiePLMonster avatar Mar 17 '23 18:03 CookiePLMonster

Is this still in limbo?

CookiePLMonster avatar Oct 01 '23 10:10 CookiePLMonster

Seems like it. Every once in a while I think about this PR, and I feel bad about leaving it in limbo. I just have a large backlog of my own, that it'll last me at least another year; and I spend most of my free time doing own code; so it's pretty rough.

I don't actually use ModUtils myself; which means that I usually need to set up project from scratch to test any change (when I submit this PR, I just found an existing random project and played around with that). I originally stumbled across this bug while debugging a closed source mod from @zolika1351 that happened to be using ModUtils. Using the debug symbols they gave me; I narrowed down the issue and wrote a quick hotfix (which is what was PR'd originally here, but it can be better).

I did work on this problem recently-ish though; so I could probably come back to this. I daily drive Linux these days, so Windows install is just for debugging games, very barebones. On a fun note, I actually wrote a cross-platform solution for this purpose that provides concurrent JIT friendly buffers that can be shared by multiple mods at a time (rather than on Windows having every mod allocate 64K just for itself). I don't mind that much; just some setup would be required.

Anyway, gimme a moment to look at this again. Just need to update VS, download some build tools, setup game, etc.

Sewer56 avatar Oct 01 '23 14:10 Sewer56

After reviewing the text here; I simply changed the starting address to max(0, addr - 2GB).

While I can actually scan forward, then backwards if needed (even if VirtualQuery returns an entire region, not just starting from queried address); the catch is, most likely it'll not lead to a shorter interval between starting a scan and finding a free section.

Two reasons come to mind:

  • Additional modules are typically loaded AFTER the main module within the address space, fragmenting that region. (Even in presence of ASLR)

    • It's possible for these modules to be loaded sequentially in the virtual address space; meaning there's no gaps between them, and you may run into multiple regions that are used before you can find a free one. (Note: Not very common to be directly after main module; but is possible)
    • These module loads can be made by the PE Loader, i.e. DLLs from the Import table. Therefore, happen before mod logic typically runs.
  • When you start at -2GB, you are statistically more likely to start at a page that's already free.

    • There's typically less used memory regions below the main module (mostly thread stacks, and memory mapped files).
    • Chances are the address at -2GB is already free, while one at +0 is guaranteed used.

Sewer56 avatar Oct 01 '23 15:10 Sewer56

Looks good! With this in mind, do you think the checks on alignedAddrEnd still neded?

I don't have a test suite for ModUtils so I'll have to verify this by recompiling one of my 64-bit patches with this PR. I'll get back to you shortly.

CookiePLMonster avatar Oct 01 '23 16:10 CookiePLMonster

do you think the checks on alignedAddrEnd still needed?

It's needed because it's theoretically possible for a memory region that is

  • Below addr
  • Starts further away than -2GiB
  • Ends closer than 2GiB

In that specific edge case, a valid, free region may be omitted otherwise.

Very often during startup, such a region exists right below the main module (though it may end up fragmented/split by the time a mod runs), and is bigger than 2GiB, so this is pretty possible.

Sewer56 avatar Oct 01 '23 16:10 Sewer56

Thinking about it, code can be changed to do 2 scans from -2GB to +0 and +0 to +2GB. It would be faster, because we only in theory need to check start of buffer for addresses above, and end of buffer for addresses below. This saves a single branch and roundup/rounddown operation per page, so I'm guessing somewhere in the range of 20 instructions per typical loop iteration.

The performance difference would probably be negligible in practice though, while the entire loop would have to be cloned, increasing code size. Normally I'd do it anyway; because I have a perfectionist approach, but with lack of automated tests, it's a bit more scary to do these things in general, as doing these sorts of things can be error prone. This one's pretty trivial so I don't mind doing it though if you think it'd be beneficial to save a few nanoseconds.

Sewer56 avatar Oct 01 '23 16:10 Sewer56

I would be willing to accept whatever makes the code easier to reason about - regardless of which produces bigger or smaller assembly.

CookiePLMonster avatar Oct 01 '23 17:10 CookiePLMonster

I added documentation into the source explaining why both start and end of a region are checked. I think it should be good to merge now.

Sewer56 avatar Oct 01 '23 18:10 Sewer56

Looks good, just one nitpick: in my code, rounding down is already done via alignedAddr = (alignedAddr + granularity - 1) & ~uintptr_t(granularity - 1); while you introduced another method to do the same. It would be good to unify both.

I'll merge this once I've verified that my 64-bit patches don't break. Ideally I'd have tests to verify that but it'll be a while before I can spend time on something like this.

CookiePLMonster avatar Oct 01 '23 18:10 CookiePLMonster