3Dmigoto icon indicating copy to clipboard operation
3Dmigoto copied to clipboard

Grim Dawn Performance Issues (Was [Index] buffer hash not supported by texture filtering)

Open DarkStarSword opened this issue 7 years ago • 29 comments

The index buffer filter support is now storing the hash with the resource via SetPrivateData instead of our G->mResources map, so it doesn't work for texture filtering or checktextureoverride. Currently we are wasting CPU time hashing buffers and we have no way to use that hash. This should not be considered limited to index buffers either - constant buffers, structured buffers, etc. can all be hashed as well.

SetPrivateData may turn out to be a superior option, but we need consistency here for everything to work together, and how we choose to store the hash should be handled in ResourceHash.cpp

Advantages/Disadvantages of various options: mResources:

  • Adv: All our code already uses this, proven to work, path of least resistance
  • Adv: We track all our state - easy to update on ini reload
  • Dis: This is yet another map we look up by resource address, and we have no way to know if the game has freed a resource, so our map can only grow.

SetPrivateData

  • Adv: We have less state to track
  • Adv: DirectX handles releasing anything we place here, limiting the amount of memory we can grow by for long running sessions
  • Adv: This API is intended for pretty much our exact use case
  • Dis(?): That GUID is awfully long - for the odd use here and there I'm sure it won't matter, but what's the performance of this if we need to look it up a lot?
  • Dis: We still have global state that is not associated with a specific texture object (associated with the hash) that we need to change on ini reload (so no pointers!).

Wrapping

  • Adv: All our data structures are right there in one place
  • Dis: Have to locate (or construct a new) wrapper whenever the game gets an object from DirectX, have to unwrap whenever the game passes an object to DirectX
  • Adv: We know when Release() is called and can release our own data structures
  • Dis: We still have global state that is not associated with a specific texture object (associated with the hash) that we need to change on ini reload (so no pointers!).

Inheriting

  • Adv: All our data structures are right there in one place
  • Adv: We know when Release() is called and can release our own data structures
  • Dis(?): May not work at all because DX will be expecting the interface to an object they created, not an interface to our object. If the object was self contained this might work, but it's called an interface for a reason, so I have my doubts.
  • Dis: We still have global state that is not associated with a specific texture object (associated with the hash) that we need to change on ini reload (so no pointers!).

DarkStarSword avatar Sep 22 '17 18:09 DarkStarSword

https://github.com/masterotaku noted a performance regression in 3DMigoto 1.2.63:

https://forums.geforce.com/default/topic/685657/3d-vision/3dmigoto-now-open-source-/post/5230549/#5230549

I believe this is likely due to the re-enabling of the index buffer hash in https://github.com/bo3b/3Dmigoto/commit/bfa2622f209f331679036a2ac455b8c56d3399d1

The problem here is that while (most) games will (mostly) limit texture creation to loading times, they are far less likely to do the same for buffers, given that can include vertex buffers, index buffers, geometry instance buffers, constant buffers, structured buffers, raw buffers, typed buffers and UAV variants of these. While many games will update these with Map/Unmap or UpdateSubresource and will only pay the price of hashing them once, some games may continually create new buffers and discard the old ones, which will now pay the price of hashing these every time.

I don't think we want to just disable the buffer hashes again - they can be useful (at least, once they are hooked up to the same infrastructure as texture hashes they will be), but we do need to try to avoid paying this price if we aren't using them.

We don't necessarily know in advance if we will need them or not - there's nothing that implies this in the d3dx.ini except for the existence or lack thereof of any [TextureOverride] section at all.

A few ideas:

  1. Add a global option to select what kind of resources we hash. This would default to Texture2D + Texture3D for backwards compatibility, but this would then allow us to enable buffer hashes if we want them (and may as well add Texture1D while we're at it, which I think was still missing). It would be worthwhile adding options to check the bind flags to distinguish between constant buffers, vertex buffers, index buffers, etc because it is unlikely we would ever want to hash all types of buffers. That also could possibly be expanded to check the bind flags of textures we hash, though a quick observation is that we will almost always want SRVs, RTVs and DSVs to be hashed so that may not be worth the complexity.

  2. Add an option to [TextureOverride] sections to indicate that they match a buffer (and optionally which type of buffer) instead of a texture, so that we can imply what types of resources we need to hash rather than needing an explicit option. It might be worthwhile considering the possible future expansion of adding more explicit matching to the description later on instead of hashing it, as this would be the starting point for that since it specifies the resource type, and optionally bind flags and I'd hate to have to change the syntax again later (what I'm getting at is that we should probably keep the type and bind flags separate, because that's what we will want in the future). For backwards compatibility type should default to "Texture2D | Texture3D", and bind flags should default to "any". Note that [Resource] sections already include these types of options - we should make sure we are consistent with it where possible.

  3. Experiment with the possibility of delaying the texture hash until we first need it - which would be significantly more expensive when we do due to needing to pull the data back from the GPU, and may be problematic if the data has been overwritten or is in any of the block compressed formats, but might still save us overall since it can mean there is a lot of resources we don't need to hash. This is not possible for the creation mode overrides in [TextureOverride], and probably should be reserved for some day when we might completely overhaul the hashing infrastructure.

Right now, option 2 is my preferred choice.

DarkStarSword avatar Oct 22 '17 15:10 DarkStarSword

Did a quick profile of the code, comparing 1.2.56 to 1.2.65 on Alien Isolation. Alien does not use any of the advanced texture features. x32 in case that matters. I used this because it was installed, and it's a game where we'd want no impact from unused features.

Module use, Exclusive 1.2.56: image

Module use, Exclusive 1.2.65: image

So, we are using an extra 0.4% of the CPU in this case. 3.47% total. Actually quite a lot higher than the 0.8% I've previously measured, we might have other problems. But in this case it's 3.09%, vs. 3.47%.


This is where the extra time is spent: image

Unless I'm misreading this, that means that GetPrivateData is actually using a std::hash internally, which would make it a poor choice for our use.

bo3b avatar Oct 23 '17 02:10 bo3b

I suspected it might - the GUID is too long to be used directly as a key so if they were storing it in a hash map they would have to be doing some type of transformation on it. Thanks for that - for now I'll switch it to use mResources as the path of least resistance to get things working and see if I can optimise some of this stuff to skip any hashes we know we don't need.

DarkStarSword avatar Oct 23 '17 03:10 DarkStarSword

This GetPrivateData is definitely where the slowdown was introduced.

On the profiles, our code at 1.2.65 use 0.92% of the CPU now, and in the 1.2.56 case used 0.81%. (The 3.47% above is for the real d3d11, no symbols on old version confused me. Snaps above are for system d3d11.dll)


This is using the Exclusive Samples %. As: image

image

That suggests that the lookup is indeed the most expensive part. If that is true, then to circle back to the original question here- my suggestion would be to use Wrapping to avoid this problem. With the caveat that I'm unlikely to be thinking about this in a big enough picture. Also, this is just one test example.


If we create Hacker subclasses of the different buffer objects, they can store their hash internally as a reference. Or possibly we can just use the object pointer itself and then call a specific method for operations we are interested in.

In this example, the HackerDevice::CreateBuffer would instead create a HackerBuffer as a wrapped object, and return that. In the HackerContext::IASetIndexBuffer, it would just fetch the hash from the HackerBuffer, and not have to pay the lookup price.

It's possible we run into problems with games that care that we pass back a HackerBuffer instead of an ID3D11Buffer, but it's a subclass, and the vtables match. As a general rule, this works throughout all of our other objects, with only occasional weirdness like MGSV.

Not at all sure if this makes sense, I defer to your judgment. I would just like to see us get out of the business of driving up and down hash lists if at all possible, and this seems like a venue for an architectural solution.

bo3b avatar Oct 23 '17 04:10 bo3b

Any change to wrapping would have to be a major update - like, a 1.3 version bump type of thing. I can certainly see some advantages, but it is a major change and introduces a lot of potential for crashes if we don't plug every call that passes a resource to/from DirectX. It's also not clear to me how well it would perform for games (or overlays) that call things like PSGetShaderResourceView a lot where we might end up spending a lot of time looking up maps to find our wrapper for the resource being obtained - I don't think we can pass our wrapper to DirectX, and if we're storing information in the wrapper object then we can't just create a new one as we do for the device & context wrappers.

DarkStarSword avatar Oct 23 '17 05:10 DarkStarSword

That sounds right. It would be experimental branch to begin with, might not work. I was thinking this would just be a straight subclass of the DX11 objects, like ID3D11Texture2D and ID3D11Buffer type objects. Just like we do for the top level objects of Device, SwapChain, Context.

Why would you not expect to be able to just pass the wrapped objects to DX11? If that doesn't work, there is no point in going this route, but my expectation is that DX11 won't care.

COM objects are fake in the sense that there is no language support for them, and doing manual OO programming is silly, but they seem to be sufficiently like C++ objects to be passed around in DX11. We seem to be able to subclass without any real problems, just cannot do normal things like call the superclass.

bo3b avatar Oct 23 '17 09:10 bo3b

Why would you not expect to be able to just pass the wrapped objects to DX11? If that doesn't work, there is no point in going this route, but my expectation is that DX11 won't care.

Well, we're not wrapping a full object - we're only wrapping an interface to a larger object, and I don't know how DirectX finds the larger object from the interface we pass it. If this happens through a public method such as QueryInterface or GetPrivateData we'll be fine, but it is quite likely that isn't the case. Essentially, the "this" pointer won't be what DirectX is expecting, so when it goes looking for the private members in the concrete implementation it will instead find the ones that belong to HackerResource in their place and crash. This is essentially the same reason why we have to find the original device in CreateSwapChain and can't pass the HackerDevice to the original call.

DarkStarSword avatar Oct 23 '17 10:10 DarkStarSword

COM objects are fake in the sense that there is no language support for them, and doing manual OO programming is silly

I don't call that fake, nor is it inherently silly. It is something that a lot of projects have tried to do, but most have done quite poorly, and compared to those - COM actually seems quite decent.

If you want to talk about silly OO, C++ objects are silly - they don't have a stable ABI, so you can't put them in a library, and if you do then anything that uses that library has to be compiled with the exact same toolchain as the library or it may not work - hence why extern "C" is a must for most libraries to get a stable ABI, but that isn't OO (technically C doesn't define a stable ABI either, and the fact that there are something like three competing calling conventions in 32bit Windows is an artefact of that, but those calling conventions do define a stable ABI. In Linux the ABI is defined in the ELF specs for each CPU architecture). COM also has a stable ABI, so if one did want to create an OO library such as DirectX, that would be attractive for that reason.

but they seem to be sufficiently like C++ objects to be passed around in DX11

I'm not positive, but I get the impression they were modelled on C++ objects since they seem to implement the same vtable, but with a stable ABI in mind making them suitable for use in a library.

just cannot do normal things like call the superclass.

Not positive what you mean by that - the headers that DirectX gives us only defines empty interface definitions, and it doesn't really make sense to talk about a superclass of a definition, because they have to be empty as well - if anything, by calling the original object we're calling a private subclass that we don't know any details of beyond that it implements that interface, not a superclass. Interfaces aren't specific to COM, it just happens that COM enforces their use.

DarkStarSword avatar Oct 23 '17 10:10 DarkStarSword

Why would you not expect to be able to just pass the wrapped objects to DX11? If that doesn't work, there is no point in going this route, but my expectation is that DX11 won't care.

Well, we're not wrapping a full object - we're only wrapping an interface to a larger object, and I don't know how DirectX finds the larger object from the interface we pass it. If this happens through a public method such as QueryInterface or GetPrivateData we'll be fine, but it is quite likely that isn't the case. Essentially, the "this" pointer won't be what DirectX is expecting, so when it goes looking for the private members in the concrete implementation it will instead find the ones that belong to HackerResource in their place and crash. This is essentially the same reason why we have to find the original device in CreateSwapChain and can't pass the HackerDevice to the original call.

Your point is well taken that the objects passed around follow some abnormal/unusual technique, since they are not true C++ objects. If they were true C++ objects, we would not have this problem, because the subclass rules would work even for private interfaces. If they were sure to always use their defined interface of QueryInterface, we'd expect it to work. However, we cannot be sure they are not cheating the object model behind the scenes.

The COM objects are fairly close to standard C++ objects, with the added weirdness of using QueryInterface instead of superclass rules. The vTable is a direct copy of C++, but I'd not expect it to handle pure virtual or multiple inheritance. But if they were to use normal syntax, it should still be possible for them to get to private functions in the vtables. As near as I can tell, there are no private functions declared as part of the objects (the objects are directly calleable from a C interface), so if they exist that might very well use a different mechanism altogether.


The CreateSwapChain in DXGI is an interesting case. We do in fact always expect to receive the incoming Device as a HackerDevice, not a raw device. And passing in our HackerDevice instead of the raw device does in fact work, but only if platform_update is installed. Bug? No longer cheating? Very hard to say.

For the most part, our wrapped/subclassed objects work just like the originals, with a couple of nagging hard to understand exceptions.

In general, I've been trying to wrap full objects, not just the interface we are interested in. DXGI is nearly completely wrapped, including the object hierarchy. This has generated a lot of boilerplate of course, but gives us easy logging.

This could be the problem with platform_update support as well. In general, I've been trying to completely wrap them all to make a pseudo-subclass C++ object. Could easily be why it crashes on a couple of games, but I have not fully debugged this to understand why.

bo3b avatar Oct 24 '17 06:10 bo3b

COM objects are fake in the sense that there is no language support for them, and doing manual OO programming is silly I don't call that fake, nor is it inherently silly. It is something that a lot of projects have tried to do, but most have done quite poorly, and compared to those - COM actually seems quite decent.

If you want to talk about silly OO, C++ objects are silly - they don't have a stable ABI, so you can't put them in a library, and if you do then anything that uses that library has to be compiled with the exact same toolchain as the library or it may not work - hence why extern "C" is a must for most libraries to get a stable ABI, but that isn't OO (technically C doesn't define a stable ABI either, and the fact that there are something like three competing calling conventions in 32bit Windows is an artefact of that, but those calling conventions do define a stable ABI. In Linux the ABI is defined in the ELF specs for each CPU architecture). COM also has a stable ABI, so if one did want to create an OO library such as DirectX, that would be attractive for that reason.

I'm certainly not the guy to argue that C++ did things very well, I find it to be extremely hard to use by today's standards. But standards are generally better than not, even if the standard is crufty. This is why COM is still valuable also. It's even a little more standard than most things Microsoft does.

In my book this is still silly though, and I'll lump C++ in there too, even if it is the big standard.

The reason these are silly is because they make more work for the programmer, not less. The programmer is always the weak link in the chain for programming. The weak link for quality, for bugs, and for schedule. Anything that puts more burden on the programmer is a mistake. Again, in my opinion. Making us keep track of all these millions of niggling little details is a waste of time, and leads to a raft of mistakes that should never be possible.

The entire point of OO originally was to try to lower the programmer burden, and make (at least some) stupid mistakes not possible. Any implementation that requires me to keep track of the runtime details blew it as far as I'm concerned. That's why I'm not at all a fan of hand-built object junk. If we don't have language support to help us manage the details, I'm not sure OO makes any sense.

bo3b avatar Oct 24 '17 06:10 bo3b

but they seem to be sufficiently like C++ objects to be passed around in DX11

I'm not positive, but I get the impression they were modelled on C++ objects since they seem to implement the same vtable, but with a stable ABI in mind making them suitable for use in a library.

That's the best explanation for COM I've ever run across- that they could have used C++ as the clear choice, but the lack of a consistent ABI made it a poor choice for sharing. I'm not sure about the timelines involved, all of this stuff is pretty ancient.

just cannot do normal things like call the superclass.

Not positive what you mean by that - the headers that DirectX gives us only defines empty interface definitions, and it doesn't really make sense to talk about a superclass of a definition, because they have to be empty as well - if anything, by calling the original object we're calling a private subclass that we don't know any details of beyond that it implements that interface, not a superclass. Interfaces aren't specific to COM, it just happens that COM enforces their use.

They do have the concept of subclasses as part of DX11. The objects inherit from different levels, all the way down to the IUnknown interface.

They also implement the superclass function, it's done through QueryInterface. You can't directly use a given superclass function for constructors. But you can just this->QueryInterface(uuid(type)) to get a class reference, and then call it directly. This is how they allow access to other related objects, and do surface sharing for example. This is all a bit tedious, manual, and error prone, but this is how they expect people to use it.

Once you have an object reference it's a real object, not just an interface. This is why we can subclass their objects as sort-of wrapper, but still just directly call their methods if keep a pointer reference. It would superior if in an overridden class like a wrapped/overridden function, if we could call the superclass directly. They follow a lot of the C++ rules, which is why it's tantalizing close, but strangely different.

Always possible I am not thinking about this correctly.

bo3b avatar Oct 24 '17 06:10 bo3b

This should be fixed by https://github.com/bo3b/3Dmigoto/commit/e70dc8ec3ad1ce9dddfece31c0235d01e15fada3

This commit does the minimum necessary to switch to mResources, which aside from making index buffer filtering, checktextureoverrides, presets and so on work with index buffer hashes, allows us to compare the performance of GetPrivateData with the mResources map lookup - additional optimisations are certainly possible, but I wanted a way to do a direct comparison.

My initial results in a 40 second profile of Alien Isolation shows:

GetPrivateData
--------------
Name                            | Inclusive Samples | Exclusive Samples | Inclusive Samples % | Exclusive Samples %
HackerContext::IASetIndexBuffer | 483               | 67                | 0.99                | 0.14
std::Hash<ID3D11Resource*...>   | 17                | 17                | 0.03                | 0.03

mResources
----------
Name                            | Inclusive Samples | Exclusive Samples | Inclusive Samples % | Exclusive Samples %
HackerContext::IASetIndexBuffer | 330               | 19                | 0.71                | 0.04
std::Hash<ID3D11Resource*...>   | 218               | 218               | 0.47                | 0.47

We need to compare inclusive samples of IASetIndexBuffer to count the samples that are inside the GetPrivateData and hash calls, where we see 153 less samples in the same time period. There is more samples in std::Hash now, but that is to be expected since it is now us performing the hash rather than DirectX, and the overall reduction in the number of samples does seem to confirm that mResources is indeed faster than GetPrivateData.

Further optimisations should be possible - there isn't much reason to run the code in IASetIndexBuffer at all when hunting is disabled since the command list code doesn't rely on it. It also might be worthwhile looking into optimising the hash function we use in our maps - since we are using a memory address as a key we should take a lesson from CPU caches and see how it compares if we just use part of the address as the key (ignore the lowest bits which are not unique due to memory alignment constraints, and take the next n bits depending on the number of buckets in the hash map).

DarkStarSword avatar Oct 24 '17 08:10 DarkStarSword

https://github.com/bo3b/3Dmigoto/commit/85f4cf32ecb9977de02554ca95438bcb819ce295 gets rid of the performance cost in IASetIndexBuffer entirely when not hunting - since we retired the old indexbufferfilter and the command list gets the hash on demand if/when it needs it we only need mCurrentIndexBuffer for hunting, so there is no point paying that cost otherwise.

I'm inclined to close this issue now - we could look at optimising CreateBuffer to skip hashing buffers we don't care about, but we haven't seen that show up in profiling so it might be better to not complicate that until we see it as a problem in practice - however, I would only expect that to show up as a problem in games that continually destroy and recreate buffers rather than updating them, and Alien which we've been using for profiling is not one of those games.

It might be worthwhile checking the performance in Grim Dawn to see if we need to do any further heroics since that's the game the performance issue was reported in, but otherwise I think this is done.

DarkStarSword avatar Oct 24 '17 14:10 DarkStarSword

If they were true C++ objects, we would not have this problem, because the subclass rules would work even for private interfaces.

I'm not sure there would be any difference (other than QueryInterface) so long as they still exposed an interface definition rather than an implementation definition - we can't mutate their concrete object into a Hacker object (if it was Python we could, but that would be ugly - been there, done that, "elegant hack", horrible to maintain - don't do it), so we still have to wrap it and call it's methods through the public interface, which is what we are doing now. What exactly do you see that would be different?

If they were sure to always use their defined interface of QueryInterface, we'd expect it to work. However, we cannot be sure they are not cheating the object model behind the scenes.

Agreed.

The COM objects are fairly close to standard C++ objects, with the added weirdness of using QueryInterface instead of superclass rules. The vTable is a direct copy of C++, but I'd not expect it to handle pure virtual or multiple inheritance. But if they were to use normal syntax, it should still be possible for them to get to private functions in the vtables. As near as I can tell, there are no private functions declared as part of the objects (the objects are directly calleable from a C interface), so if they exist that might very well use a different mechanism altogether.

I'm not sure if you quite get the point of interfaces - you are correct that there are no private members or methods defined in the interface definition, but that's because these will be defined in an implementation definition that we aren't privy to. This is actually exactly the same thing that we're doing with HackerDevice and HackerContext - we are creating objects that implement the ID3D11Device interface, which allows the game to work with those because the interface definines the public API that the game uses, but the game has no way to find any private members/methods that we have defined in our implementation of HackerContext/HackerDevice because it is not aware of our implementation definition - yet we can find those quite easily. In the same way we are not aware of DirectX's implementation definition, yet it can find it's private members very easily - I am pretty certain that DirectX will have private members and methods that they use internally, but we just can't see them because they are part of the implementation, not the interface.

The CreateSwapChain in DXGI is an interesting case. We do in fact always expect to receive the incoming Device as a HackerDevice, not a raw device.

We recieve that device from the game though, not DirectX, which we expect because we handed the HackerDevice to the game so we expect to get that back.

And passing in our HackerDevice instead of the raw device does in fact work, but only if platform_update is installed. Bug? No longer cheating? Very hard to say.

Now, that is interesting... and I'm not sure what to make of it. I would not recommend relying on that without understanding why it works.

In general, I've been trying to wrap full objects, not just the interface we are interested in.

I don't think so, but without seeing the implementation details of DirectX I can't be sure. I suspect if you dig into it you would actually find that DXGIDevice, D3D11Device, DXGIDevice1, D3D11Device1, etc. are all the exact same object under the hood (depending on the platform level - in fact, I strongly suspect that D3D10Device* and D3D12Device* will also be part of the exact same object - didn't we see one game that tried to use QueryInterface to switch between DX10 and DX11? Or was that just feature levels?), that implement several different interfaces - otherwise QueryInterface wouldn't work to navigate between them. ~~~~If I'm understanding the documentation correctly we can probably check this by using QueryInterface to get the IUnknown interface of suspected identical objects - COM's one requirement of IUnknown is that the same handle be returned for the same objects (this is not a requirement of any other interface in COM).~~~~ No, the doco says that's only for multiple inheritance and might even be optional even then, and definitely not the case nested classes, so seeing different values won't necessarily mean anything - they could still be using composition to achieve this, which would kind of make more sense anyway.

DXGI is nearly completely wrapped, including the object hierarchy. This has generated a lot of boilerplate of course, but gives us easy logging.

Don't forget that there is a distinction between interface inheritance, and implementation inheritance - they are in fact completely separate concepts (same machinery), we just happen to mirror the IDXGIDevice interface heirachy in our HackerDXGIDevice implementation heirachy, but it would work without that as well. In fact, we could have a massive object that implements each interface (DXGIDevice + D3D11Device) via multiple inheritance, or even use composition and scrap multi-level implementation inheritance altogether. Not saying that would be better - just another possibility.

They do have the concept of subclasses as part of DX11. The objects inherit from different levels, all the way down to the IUnknown interface.

Not necessarily - that's interface inheritance you're referring to, which is a totally different kettle of fish to implementation inheritance (IUnknown is just as much an empty interface definition as any of the others - it's up to DirectX to implement the reference counting - that isn't provided by COM, and this is part of the reason that COM is language neutral).

We don't know what the internal hierarchy of the true DirectX objects are - they might be using implementation inheritance that happens to mirror that as well, or they might be using composition, or they could just be flat objects. They could even just be static functions and a couple of C structures for all we know - those details are entirely hidden away from us. Hell, technically they could be implemented in Python and use duck typing for this - they aren't, but we don't know that, so we're really just guessing.

Once you have an object reference it's a real object, not just an interface.

Yes, that's true - but the only thing we know about that object is that it implements that interface. We don't know anything more about it, but DirectX does.

This is why we can subclass their objects as sort-of wrapper

No no no - we're subclassing an interface definition, we aren't subclassing the DirectX object. We're creating an entirely different object that implements the same public API as the DirectX object, but we are not the DirectX object.

but still just directly call their methods if keep a pointer reference.

Well yeah, but we're just calling the object that DirectX handed us - that's not a part of our object hierarchy at all, it just happens to implement the same interface that we do but that is where the similarities end.

It would superior if in an overridden class like a wrapped/overridden function, if we could call the superclass directly.

For that to work we would have to be privy to their implementation definition (including all the protected and private members that you don't want in a public header file) and we would have to be the one allocating the object ourselves so that our object is the DirectX object, not just one that implements the same interface. By design that's not possible in COM, and that's a good thing because that leads to all sorts of nasty problems down the line. QueryInterface is something that I'm not so sure was a good idea, but even though it has caused us no end of problems I can see some advantages to it and we're not exactly a typical user of this stuff, so I'm reserving judgement on that one.

DarkStarSword avatar Oct 24 '17 17:10 DarkStarSword

It might be worthwhile checking the performance in Grim Dawn to see if we need to do any further heroics since that's the game the performance issue was reported in, but otherwise I think this is done.

Let's go ahead and leave this open, until we can double check the results in Grim Dawn. When we make a new build, we can have masterotaku try it to be sure.

bo3b avatar Oct 25 '17 06:10 bo3b

Thanks for the discussion on the differences between these DX11 COM interfaces, and actual objects. That helps clarify to me why these feel like fake objects. In fact, the pseudo-object nature of things is actually a bit of a red herring, we might be better suited with a different, less misleading structure. We are straight up wrapping, but it misleadingly looks like a subclass. We have had such good real-world success with wrapping that I'm hesitant to change, but hooking should work better, in theory.

The crux of the problem is that we don't have access to their constructor. That's because this is just an interface, not access to their implementation. If we could call their constructor as part of our constructor, we could make genuine objects that could also include their private object details. We wouldn't see those details, or need to, but they'd work. All the DX objects are created by API calls which we intercept, so there are no constructors.

But, the interfaces are exactly as you've described. All the DX header files define their interfaces using the PURE macro, which winds up being __purecall, and ultimately is explicitly describing that this is a pure virtual interface, there is no implementation.


But... In the header files, we can enable the C implementation, which I do for the DX9 interfaces in my upcoming SBS support for DX9 games. That C implementation directly accesses the vtable pointer to get access to the routines requested. It's clearly just a partial peak into their implementation, not the full deal, because there are no constructors or other object references. It says it right there, I just didn't want to believe. :->

This level of vtable is suggestive, but is a red herring.


I don't think so, but without seeing the implementation details of DirectX I can't be sure. I suspect if you dig into it you would actually find that DXGIDevice, D3D11Device, DXGIDevice1, D3D11Device1, etc. are all the exact same object under the hood (depending on the platform level - in fact, I strongly suspect that D3D10Device* and D3D12Device* will also be part of the exact same object - didn't we see one game that tried to use QueryInterface to switch between DX10 and DX11? Or was that just feature levels?), that implement several different interfaces - otherwise QueryInterface wouldn't work to navigate between them.

I'm not sure of course. I like your idea, and I think you have a better handle on this than I do, but some of my experiments make it less likely.

One that I just did recently while debugging a wicked DX9 problem was to try to use QueryInterface and go from an Ex object to base object and back. The DX9Ex objects are essentially the same as DX11 *1, *2 subclasses. (< geez>Microsoft, always be inconsistent from one gen to next.< /geez>).

The case in point was creating a IDirect3DDevice9 object, then comparing and trying to upconvert to IDirect3DDevice9Ex object.

If I create the object with IDirect3D9Ex->CreateDeviceEx I can use it and pass it like a IDirect3DDevice9 object, including simple type-casting. If I QueryInterface(IDirect3DDevice9) instead of type casting, that also works, and returns the exact same memory location.

However, if I create using the standard IDirect3D9->CreateDevice, then try to upcast to what I'd expect as QueryInterface(IDirect3DDevice9Ex), I get an E_NOINTERFACE error. This is where it seems odd, because really at this point, I'd expect them to all be the higher level class, as this was Vista level changes.


That type of experiment makes it hard for me to get a handle on their implementation. We kinda-sorta need some deeper knowledge like this for our stuff, because we are so deeply embedded. I'm thinking at present that the reason our platform_update=1 doesn't work is because these objects are getting passed around internally to DX and are not right.

When I check return results for D3D11Device and D3D11Device1 objects, if it's a Device1, and we get the interface to the Device, it is the exact same address. I have not checked to see if DXGIDevice might be the same, but that's a good idea.


For platform_update=1, it still seems like the right plan to return the highest level interface supported at that moment, as the interfaces suggest they will be the same objects.

However... I can say with assurance that in the DX9 case, that if you do this, it subtly breaks the game. To do surface sharing, I must create a IDirect3DDevice9Ex object, and pass that back to the game. And I must create the IDirect3D9Ex factory instead.

This has knock-on effects though, because the internal implementation changed, and it's not a strict subclass. So for example the game will merrily call IDirect3DDevice9->CreateTexture (as it doesn't know anything about the object being IDirect3DDevice9Ex). And the debug layer will cause an exception with the complaint that the pool specified cannot be D3DPOOL_MANAGED.

That means that I then need to also hook that call, and force the pool to D3DPOOL_DEFAULT. But... this clearly indicates that the IDirect3DDevice9Ex is not a clean subclass.

bo3b avatar Oct 25 '17 07:10 bo3b

Yep, agree with everything you just said, and that is some very interesting insights on the DX9 side of things. I wish we had a bit more insight into these things, and of course part of the problem is that they can potentially change the implementation details with platform updates and may well have done so, and while that should work if the API is consistent, it does risk introducing subtle differences in behaviour especially when converting between different interfaces and I wonder if that's one of the reasons the evil update and so on has caused us grief in the past.

DarkStarSword avatar Oct 25 '17 07:10 DarkStarSword

I gave @masterotaku a test build to try with the hash calculation in CreateBuffer() disabled to see if there will be any benefit to skipping that, but there was no measurable difference. I did identify a performance regression due to the hash contamination detection interacting with the index buffers (only impacts while hunting is enabled), which is fixed in 5e17f2798849f59f0c76c7274546afaf8ff3b9b2

DarkStarSword avatar Oct 26 '17 19:10 DarkStarSword

Hmmm... Seems like it's probably worth a shared game purchase for us to profile this directly.

bo3b avatar Oct 26 '17 19:10 bo3b

Yep, pirateguybrush will pick it up for us from the Steam sale when he wakes up.

DarkStarSword avatar Oct 26 '17 20:10 DarkStarSword

Grim Dawn is now on the shared accounts... downloading now

DarkStarSword avatar Oct 27 '17 09:10 DarkStarSword

Some initial tests and I can only get it down to 57fps on a slower CPU than masterotaku (4GHz vs 4.7GHz), regardless of hunting=0/1/2. I don't have the DLC, which does mean no necromancer class and less pets on screen at once (I also can't load masterotaku's save file - I just used "GD Defiler" to edit my character so I could unlock Occultist and Shaman skills to get some pets), so that might be part of the reason I'm seeing better performance. I've also got hyperthreading off and SLI on, and built 3DMigoto from the vs2015 branch, so some other possible differences there.

The profiler is acting up a bit in VS2015 free (at least it's in VS2015 free), but I've got these so far:

With hunting=0: image

With hunting=2: image

ProcessScissorRects is taking up 1.84% CPU time with hunting disabled. I'm looking at inclusive for that one, since anything in that function is extra work we've added, regardless of which function/DLL it takes place in. AfterDraw() also has a related hotspot (0.3%) restoring the original rasterizer state. Do you know if anyone has started using this? We can probably optimise this (getting rid of the PrivateData would be a start), but part of me would just rather pull it out entirely because the command list can do the exact same thing, but only pays the cost for shaders that are actually overriding the state (that was part of the design goal of the command list - pay the cost once for the ShaderOverrides to fire them off, then only pay additional costs for what we are actually using). Maybe I could remove these but add an alias or something to make these trigger the required command list magic to do the same thing for backwards compatibility...

Frame analysis log shows adds up to about 0.4% CPU (I expect each call should be fairly lightweight, but there are a lot of calls, plus each call also checks if LogDebug is enabled so they are really doing two things), so I might look into restructuring things to make that optional after all (probably not through inheritance, but an optional extra layer in the wrapping might work and still keep things fairly clean. I might try inlining first though and just see how that compares).

We also see BindStereoResources adds up to about 2.5% CPU (inclusive again since this isn't a wrapped call) - we need them in the draw call for games that constantly unbind everything, but we could experiment with piggy backing them on the bind calls from the game to enforce that our slots remain ours even if it tries to unbind them, which might save us from having to do this every draw call.

TrackAndDivertMap/Unmap add up to about 0.18%. I don't want to touch them too much at the moment because they are in flux with the UE4 support / extension DLL.

With hunting enabled we also see the hash contamination detection is taking up about 1.8% CPU. That's to be expected, but nowhere near as bad as it was while checking buffers. I might still add an option to disable that for cases where we know it isn't needed (but it's a nice warning so I'd like to leave it on by default).

Edit: The software mouse cursor is also costing a little (biggest single cost in present & command lists - a little over 1%). This can be optimised by caching the cursor texture, but we need to make sure that we reliably know when this needs to be updated (changing between cursor icons, or animated cursors).

DarkStarSword avatar Oct 27 '17 13:10 DarkStarSword

https://github.com/bo3b/3Dmigoto/commit/ca963dab5ae4ad35e7776f3fd1a679c84944a9ce removes the conditional disable_scissor and restores the original code, which masterotaku has tested and confirmed that it works and restores some of the lost framerate.

We need to decide whether to implement a backwards compatibility alias or not before the next release.

DarkStarSword avatar Oct 27 '17 18:10 DarkStarSword

https://github.com/bo3b/3Dmigoto/commit/d3fcb8c81fba365759369a03896bdb3ac7158add and https://github.com/bo3b/3Dmigoto/commit/bfa64cdc105efb70507bf487c28bb4be40f6d9f5 re-implement support for disable_scissor using the custom shader method. I ended up deciding that since I want support for having the upcoming UE4 extension DLL write ini sections on the fly that this could be done in a similar way, so the groundwork here should be usable when that goes in.

DarkStarSword avatar Oct 29 '17 16:10 DarkStarSword

https://github.com/bo3b/3Dmigoto/commit/b7148dec821e5bf8a28c621cc8227af851c0a11f uses a new strategy for binding StereoParams and IniParams - instead of making sure that they are bound on every draw call this now binds them when the HackerContext objects are first created and makes sure that the game doesn't unbind them via any of the XXSetShaderResource calls (tested in Akiba's Trip to make sure that was working). On my system this saves around 2% CPU in Grim Dawn - waiting to hear back from masterotaku on how that affects framerate.

This is a profile taken from the current top of tree with hunting disabled: image

Looking a lot better than it was - DrawIndexed() now only takes 0.66% CPU inclusive (of which the original DirectX call is 0.4%), compared to the 5.25% inclusive it was consuming before.

Currently the software mouse appears to be the biggest remaining impact, primarily because UpdateCursorResources is consuming 1% CPU since it doesn't cache anything. The only other blips that are even really showing up is frame analysis (0.45%), NvAPI_IsStereoActive (0.11%) and the resource hash (0.09%), but they are fairly small, so we will be getting to the realm of diminishing returns pretty soon (I can think of a few ways to blow out the resource hash though, like checking it from a [ShaderRegex] that matches every shader). Hash contamination detection still shows up with hunting on of course.

DarkStarSword avatar Oct 30 '17 00:10 DarkStarSword

I'll be ready to try the new test build when I'm home. Around 9 hours from now. You can send it to me through Steam or in a PM in the Geforce forums. A 2% difference will be hard to notice at low fps, but I'll try to see it in any way I can (maybe in a higher fps location with vsync disabled).

masterotaku avatar Oct 30 '17 07:10 masterotaku

A 2% difference will be hard to notice at low fps

I'm predicting you could easily see another 3-4fps improvement from this, based on the differences you have already reported - the scissor clipping issue was a little under 2% CPU as well and eliminating that gave us a 3fps improvement.

You can send it to me through Steam or in a PM in the Geforce forums

I sent it to you via Steam before I went to bed. Alternatively, it is here: https://darkstarsword.net/3Dmigoto-1.2.67+stereoparams-optimisation.zip

DarkStarSword avatar Oct 30 '17 07:10 DarkStarSword

Thanks. I'll try it just as I arrive at home. I hope the difference is big as you say. It would put us closer to the original performance without 3Dmigoto (at around 5-8% less fps). Then I should try it with software mouse cursor disabled to, to see how many fps it eats.

masterotaku avatar Oct 30 '17 07:10 masterotaku

Tested!

Before (no pets): 53fps Before (all pets): 48fps

After (no pets): 59-60fps (even without vsync. It wasn't any vsync or fps limit because in other areas I get more fps.) After (all pets): 51fps

This is a good improvement, great job :). Disabling the software mouse cursor didn't make it run at any more fps in those "After" scenarios. It's curious how the first example shows a ~13% improvement while the second example shows a ~6% difference. Different CPU needs, I guess.

I didn't see any bugs in the few minutes I tried it (hotkeys worked, effects were fixed, etc).

Edit: in the "all pets" case, without 3Dmigoto I get 57fps, for reference.

masterotaku avatar Oct 30 '17 17:10 masterotaku