popcnt_emulator icon indicating copy to clipboard operation
popcnt_emulator copied to clipboard

SIMDe

Open AndreyTykhonov opened this issue 5 years ago • 42 comments

https://github.com/simd-everywhere/simde

This project looks promising. I tried to add mm_dp_ps support (to fix SSE 4.1 in Cyberpunk) but failed to compile after that. Maybe you will be interested

AndreyTykhonov avatar Dec 15 '20 14:12 AndreyTykhonov

DPPS should already be implemented, maybe you are hitting https://github.com/simd-everywhere/simde/issues/648 As for popcnt specifically, they choose not to to include it in the project (since I guess it's not technically a SIMD instruction?)

Anyway, duh, shit. I can't believe this flew under my radar this spring when I was looking for alternatives to SSEPlus.

mirh avatar Dec 15 '20 16:12 mirh

@AndreyTykhonov I am interested in hearing more about your SIMDe failure to compile; can you share more details?

mr-c avatar Dec 15 '20 17:12 mr-c

@AndreyTykhonov I am interested in hearing more about your SIMDe failure to compile; can you share more details?

I was in very beginning. I use primary C#, so C++ is something crazy for me. I compiled sample pin 3.14 project, added instruction watch like in this code, but when I trying to include SSE 4.1 header I receive crazy amount of errors related to pin modules (even without simde calls / variable, just after including!). I tested simde in console app and it works perfect, but with pin something crazy is going. So I think that I should waste my time learning C++ to fix Cyberpunk instead of developers and removed solution with game, lol. But think that it can by handy for someone and posted here information about simde

Before this I compiled project with simde mm_dp_ps, watched in debugger asm code and injected in game jmp to new memory where asm code from compiled exe, lol. I even got it to cyberpunk logos, but it too crazy so I stopped this research

AndreyTykhonov avatar Dec 15 '20 18:12 AndreyTykhonov

Pin can only include 3 very specific headers. ... and it already emulates everything up to even AVX512 I think.

If you are trying to extend the new icudt.dll that's a completely different approach.

mirh avatar Dec 15 '20 20:12 mirh

@AndreyTykhonov thanks! Looks very promising! Actually I've been looking for implementations since I crashed into those AVX instructions after prologue. Pintool does implement everything, but it doesn't allow using it's implementations freely.

If you wish to add support for it, you will have to include SIMDe headers, add another HOTFIX macro like this:

#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

Refactor the calls like:

HOTFIX_POP(0x045AD8D, ctx->Rax, ctx->Rcx, 5);
HOTFIX_DPPS(0xsomething, a, b, imm8, size);

And the trickiest part - find all those usages of dpps in the game and describe them with HOTFIX_DPPS. You can actually use hotpatch.log, which is being written by this library after every "unknown instruction" crash, but depending on the number of calls to these instructions it can be too tedious.

I used a modified version of instruction_hook tool to automate that for popcnt, will share it later.

ogurets avatar Dec 16 '20 09:12 ogurets

@AndreyTykhonov thanks! Looks very promising! Actually I've been looking for implementations since I crashed into those AVX instructions after prologue. Pintool does implement everything, but it doesn't allow using it's implementations freely.

If you wish to add support for it, you will have to include SIMDe headers, add another HOTFIX macro like this:

#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

Refactor the calls like:

HOTFIX_POP(0x045AD8D, ctx->Rax, ctx->Rcx, 5);
HOTFIX_DPPS(0xsomething, a, b, imm8, size);

And the trickiest part - find all those usages of dpps in the game and describe them with HOTFIX_DPPS. You can actually use hotpatch.log, which is being written by this library after every "unknown instruction" crash, but depending on the number of calls to these instructions it can be too tedious.

I used a modified version of instruction_hook tool to automate that for popcnt, will share it later.

Wow! Thanks! Did you fixed project compilation with simde headers? I'm not too good and C++, after including header I got hundreds of errors :D If you can attach project with connected simde header that compiles I would grateful!

And about offsets, I already got all SSE 4.1 & SSE 4.2 instructions offsets for 1.04 version, here is my results, maybe you find a use for it (beware - there is starting offsets like Cyberpunk2077.AK::WriteBytesMem::Count, but I can recreate file with only Cyberpunk2077 reference as start point) sseInstructions.json.txt

Actually I tried to fix some instructions with assembler so this is what I fixed:

  • No fixes: Game crashed at start
  • dpps: Game crash at first logo
  • pminuw: Game crash at second logo
  • pmaxuw: Still at second logo crash
  • ptest: Game actually got to intro video, but after that part it stuck with high CPU usage and watchdog timeout

My asm code contained errors, so some of methods returned wrong values, I think it's the problem of freeze and crash. But looks like this is all methods that needed to got to menu. Anyway, all list of SSE functions that game used (if I not forget something):

  • dpps
  • popcnt
  • pmulld
  • pminsd
  • pmaxuw
  • pmuldq
  • blendvps
  • pcmpistri
  • blendps
  • pminuw
  • pmovsxwd
  • packusdw
  • pabsd
  • roundps
  • ptest
  • pinsrb

AndreyTykhonov avatar Dec 16 '20 11:12 AndreyTykhonov

So glad to see this thread, cause I was doing exactly the same what is described here: I've found all SSE 4.1 and SSE 4.2 function calls and was emulating them with SIMDe. I'm stuck earlier on the path tough: I'm experimenting with DPPS emulation and after the first DPPS emulated call I get "Access violation exception when trying to access 0x00000000000". It seems that DPPS returns wrong values to the registers.

My best guess is that after calling ExceptionHandler values of the registers are restored to the stacked ones and I was trying to resolve that. However according to comments here - I might be wrong.

SIMDe was successfully included into the project without any errors (few warnings), however I was using only popcnt_hotpatch project.

@AndreyTykhonov if this is not the case for you - let's investigate. For me it was very easy - I've extracted source of SIMDe into a subfolder near the popcnt_hotpatch project and used relative paths to make inclusions.


#define HOTFIX_DPPS(offset, a, b, imm8, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		(a) = simde_mm_dp_ps((a), (b), (imm8)); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

As for this code - the issue is that variables a and b should be XMM registers. They come within the exception context as _M128A structure, so appropriate casting needs to be made. Unless I'm missing anything.

I've ended up with something like:

DPPS(0x03A2C33, ctx->Xmm0, ctx->Xmm4, 0x7F, 6);

#define DPPS(offset, dest, src, mask, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		__m128 register1 = _mm_load_ps((float*) &dest); \
		__m128 register2 = _mm_load_ps((float*) &src); \
		dest = simde_mm_dp_ps(register1, register2, mask); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

And after the first call I'm getting "Access violation exception".

With SDE Cyberpunk works. I'm working with 1.06 binary.

Would be glad to get deeper into this. Suggestions?

UPD. Actual code for DPPS emulation is a bit different than above:

#define DPPS(offset, dest, src, mask, instr_size) \
	if (rip == g_imageBase + (offset)) { \
		simde__m128 register1 = simde_mm_load_ps((simde_float32 *) &dest); \
		simde__m128 register2 = simde_mm_load_ps((simde_float32 *) &src); \
		simde__m128 register_dest = simde_mm_dp_ps(register1, register2, mask); \
		simde_mm_store_ps((simde_float32*) &dest, register_dest); \
		ctx->Rip += (instr_size); \
		return EXCEPTION_CONTINUE_EXECUTION; \
	}

EvgeniySpinov avatar Jan 25 '21 00:01 EvgeniySpinov

I've made some progress with SIMDe, no more "Access violation issue".

SIMDe seems pretty effective and generates like 2-7 ASM lines instead of single DPPS call.

Now I need to form sseInstructions.json.txt, which is kind of tricky. I have all instructions list and their offsets, however I do not have length of the instructions. I've used IDA Pro to get those and I can get length of instructions one by one. But there are 1739 matches for SSE 4.1 and SSE 4.2 instructions, so running manually via hotpatch.log is not an option. As well as manually going though IDA search results.

List of instructions provided by @AndreyTykhonov is different for Cyberpunk 1.06. It has also: lea pminsd pmaxsd vpmovsxwd vpmulld vpblendw vroundps

But do not have: pminsd pcmpistri pabsd pinsrb

Can someone help me with either IDA parsing results from search occurances window or with offsets, instructions + their length?

EvgeniySpinov avatar Jan 26 '21 23:01 EvgeniySpinov

@EvgeniySpinov glad to see progress on this!

My list didn't contains AVX instructions since I not parsing it. 1.1 contains AVX instructions too, even developer said that removed it (maybe unused code?) Here instructions from 1.1 version (AVX not parsed), hope it will help you.

1851 SSE 4.1 / 4.2 instructions:

  • dpps (1498)
  • popcnt (123)
  • pmulld (76)
  • pminsd (30)
  • pmaxsd (30)
  • pmaxuw (14)
  • pmuldq (12)
  • blendvps (11)
  • pcmpistri (10)
  • blendps (8)
  • pminuw (8)
  • pmovsxwd (8)
  • packusdw (7)
  • pabsd (7)
  • roundps (6)
  • ptest (1)
  • pinsrb (1)
  • pshufb (1)

11_sseInstructions.json.txt

AndreyTykhonov avatar Jan 27 '21 06:01 AndreyTykhonov

Looks exactly what is needed! Thank you for sharing. I do not have 1.1 though, but will get an update.

Question meanwhile: could you please share a way how you generate this? Really curious of the approach and would like to use it for other projects as well.

Also a question: In this call dpps xmm1,[rsp+20],7F? Is that "20" is heximal or decimal value? My assumption that it is heximal.

And one more question about file contents. Some of the offsets are calculated from functions like "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF". Is there a way to get absolute offset for all instructions?

EvgeniySpinov avatar Jan 27 '21 08:01 EvgeniySpinov

@EvgeniySpinov I can regenerate list without "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF" if you need, just offsets after exe base position. All values are heximal.

My steps to generate list:

  1. Creating suspended game with PHacker
  2. Using Cheat Engine (CE next) looking for memory regions with executable flag
  3. In CE disassembler saving asm output as txt file, setting length based on memory regions
  4. Using my own parser to translate CE txt output to json
  5. Parsing json to remove all non-SSE instructions

AndreyTykhonov avatar Jan 27 '21 10:01 AndreyTykhonov

Right, that doesn't look I'll be able to quickly reproduce :)

I have some progress with IDA script, but I propose to unite our effort. Could you please regenerate file with absolute offset positions?

Meanwhile I'll try to write a wrapper for JSON to translate those instructions into HOTFIX calls in C++ and implementing them with SIMDe. If that would work - then we can look into details of getting list of calls+offsets in more automated way.

EvgeniySpinov avatar Jan 27 '21 11:01 EvgeniySpinov

Not that a general fix for AVX would hurt, but anyway my dudes wasn't that already fixed in patch 1.05 for cyberpunk?

mirh avatar Jan 27 '21 13:01 mirh

@EvgeniySpinov sseInstructions.json.txt

AndreyTykhonov avatar Jan 27 '21 13:01 AndreyTykhonov

Not that a general fix for AVX would hurt, but anyway my dudes wasn't that already fixed in patch 1.05 for cyberpunk?

My understanding that it was - I was able to play on my Athlon X6 1090T, which doesn't have AVX only with SSE 4.x patches. AVX was removed after shitstorm on CDPR forums from people with server Xeons, which do not have AVX either.

EvgeniySpinov avatar Jan 27 '21 18:01 EvgeniySpinov

Spent some time today moving forward on this one.

Some of the instructions are represented in weird way. For example: {"Offset":"Cyberpunk2077.exe+2BF36D3","Asm":"dpps xmm4,[7FF624208440],7F","Length":10}

Is that an address where dpp float should be taken for operation?

IDA reports on this address: dpps xmm4, cs:xmmword_142EF8440, 7F

Also this one: pmaxuw xmm2,[r8+rax*2]

In IDA: pmaxuw xmm2, xmmword ptr [r8+rax*2]

Anyone knows how to fetch second register value in C++ code from the exception?

(without these instruction calls - I can get through few logos, apparently while game is loading the rest of the stuff. Emulated only DPPS for now)

EvgeniySpinov avatar Jan 31 '21 22:01 EvgeniySpinov

Ok, I've progressed through:

  • Emulated all SSE4.1, 4.2 calls with SIMDe (some of the calls in the file are SSE3 actually, like pabsd or pshufb, I've skipped them)
  • Created parser for JSON provided by @AndreyTykhonov. As result I'm getting list of calls to be put into the source code
  • Functions which mentioned above are jumped over - I didn't find any way to process them yet. Please share if you have any ideas
  • When I start Cyberpunk2077 in debug mode of MSVS - game launches and I can get to the menu (there is some issue with the fonts though). Due to debug mode - I get super low performance (0-1 fps with 20-25% CPU load), so I didn't even try to load the save game.
  • When I start game normally - it just silently crashes. No error messages, nothing, just instant crash without appearing in Task manager even.

After experimenting with commenting out instruction calls - game starts and crashes as it should (with illigal instruction call).

My guess is that it is somehow due to a number of instruction calls, heap size, etc, cause commenting different sets of calls allows to launch the game, so the problem is not with the calls themselves. IDA can also start the game in debug mode.

Resulting dll with all the calls is 1.5M. When I comment like 10% of instruction calls (even the same call, like DPPS for instance) dll might reduce in size to 300-400Kb and then game launches.

So currently observation is: big dll - process crashes instantly. small dll - process starts.

@mirh @AndreyTykhonov Have you seen such a behavior before? Do you know which direction should I dig into?

EvgeniySpinov avatar Feb 02 '21 09:02 EvgeniySpinov

Ok, guys, I've made it. Everything works. 1727 lines with various instruction calls.

The problem is ... I get 3 fps. Same as with Intel SDE. Completely unplayable as you may guess.

How the hell, this guy makes it: https://cs.rin.ru/forum/viewtopic.php?f=10&t=71329

Look for "SSE 4.x". His patch works perfectly - I get 30-40 fps hitting my GPU.

EvgeniySpinov avatar Feb 04 '21 23:02 EvgeniySpinov

Yeah, luther_d is one sick fella. Is your fix still just expanding on popcnt_emulator though? While the current version is supposedly better than the old one, it reverts back to some form of trap-and-emulate. State of the art sounds way neater.

mirh avatar Feb 05 '21 01:02 mirh

I've based updates on popcnt_hotfix.

You think better idea is to use PIN to intercept instruction calls before they happen and emulate with SIMDe those calls instead of exception handling?

EvgeniySpinov avatar Feb 05 '21 01:02 EvgeniySpinov

I'm not really the sharpest tool on the shed, to be honest Still, I know that "handling broken eggs after they happened" is orders of magnitude slower.

My uneducated guess without any kind of actual profiling is indeed that exception handling is the biggest performance offender.

mirh avatar Feb 05 '21 03:02 mirh

Great article, which means that SDE already using PIN tool JIT compilation in order to intercept instruction calls before any exception. And performance is equal to our solution - which surprises me, tbh, I would expect SDE to be faster, since we're working with exceptions.

As a POPCNT emulator - idea of this tool is great - emulating only 1 instruction instead of whole CPU architecture allows to launch the game and have minimal impact. However whole SSE 4.x stack is heavy. BTW, Intel SDE is developing, so probably there would be a way to emulate selected set of instructions only. Haven't checked for popcnt, but probably there is a switch by now. There is definitely for SSE 4.1, 4.2, 4.3, etc.

Need to get in touch with luther_d and understand how this could be tackled. My best guess is that luther_d is not emulating all of the instruction calls required. Likely he operates on a subfunction level, jumping over functions which contain SSE 4.x where possible and emulating their output when not.

EvgeniySpinov avatar Feb 05 '21 09:02 EvgeniySpinov

Hello from the SIMDe project! When using SIMDe to cope with SSE4.1 instructions not available on the running processor, do you compile using the highest SIMD level available (like SSE3, SSE2, etc..) or are you using the unoptimized fallback implementations?

mr-c avatar Feb 05 '21 09:02 mr-c

Hey @mr-c, thank you for coming to our bonfire :) You've got a great project and great fellows who help people like me to use it.

If you mean compiler and linker options, then highest supported SIMD level is SSE2 for Phenom X6 1090T, which is default for MSVC 2019. I didn't change anything there. SSE3 is partially supported as figured later on: had to emulate pabsd and pshufb calls, cause they were causing invalid code exceptions.

If you mean using SSE1,2 within SIMDe calls, then in here: https://github.com/simd-everywhere/simde/issues/694 I was told that I should not mix them, i.e. either all native or SIMDe. So did I.

EvgeniySpinov avatar Feb 05 '21 12:02 EvgeniySpinov

:-) I'm the SIMDe cheerleader, all the credit goes to our amazing contributors!

Yep, I meant compiler options. The MSVC equivalent of gcc's '-msse2', which seems to be /arch:SSE2 and the default

According to https://www.cpu-world.com/CPUs/K10/AMD-Phenom%20II%20X6%201090T%20Black%20Edition%20-%20HDT90ZFBK6DGR%20(HDT90ZFBGRBOX).html I see that SSE3 is supported, but I don't see a MSVC command line option for that. Does MSVC automatically set __SSE3__ ? If not, you may benefit from defining SIMDE_ARCH_X86_SSE3 to 1.

mr-c avatar Feb 05 '21 12:02 mr-c

Phenom X6 1090T seems have incomplete SSE3 support. It supports IA SSE3, but do not IA Supplemental SSE3. I do not know what is the difference though, but instructions mentioned above are SSE3 instructions and I still had to emulate them.

But I've added SIMDE_ARCH_X86_SSE3 1 - that didn't trigger full rebuild. Seems like functions I'm using from SIMDe mostly using SSE2 functions.

EvgeniySpinov avatar Feb 05 '21 12:02 EvgeniySpinov

@EvgeniySpinov Interesting! I wasn't aware of the SSE3 sub-levels.

Can you remind me (maybe with a link) how SIMDe is being compiled/used?

mr-c avatar Feb 05 '21 13:02 mr-c

SSE3 is SSE3, SSSE3 is SSSE3. MSVC has way less automatic granularity than, say, gcc but still I think intrinsics should do it.

Great article, which means that SDE already using PIN tool JIT compilation in order to intercept instruction calls before any exception.

No, because like ogurets said, he's using probe and not JIT.

mirh avatar Feb 05 '21 13:02 mirh

But I've added SIMDE_ARCH_X86_SSE3 1 - that didn't trigger full rebuild. Seems like functions I'm using from SIMDe mostly using SSE2 functions.

Oops, I was wrong, you should use SIMDE_X86_SSE3_NATIVE instead

mr-c avatar Feb 05 '21 16:02 mr-c

How the hell, this guy makes it: https://cs.rin.ru/forum/viewtopic.php?f=10&t=71329

Luther_d solution is not based on exception handling. It request new memory on game start, writes ASM code to new memory that will be executed instead of not supported instructions (for example dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F IS DIFFERENT CODE)

After new memory created, he injecting jmps to new memory, like dpps xmm0, xmm1, 7F becomes jmp OFFSET_IN_MEMORY and nops, so this solution very fast

As I understand, luther_d solution is not automated, since he releasing fixes with new overloads, I think he restarts games and fixes until it working, this is reason why it not going to be updated

We can possible port fix to new game versions in few steps:

  1. Get unsupported instructions at 105 version
  2. Install fix, save all ASM code after jmp
  3. Get unsupported instructions at 12 version
  4. Try to replace same instructions with same ASM code

But there can be new instruction, so without good ASM knowledge we can't do much. I tried to use same method at beginning

AndreyTykhonov avatar Feb 06 '21 08:02 AndreyTykhonov