mvtools
mvtools copied to clipboard
Hardware accelerated Motion Search (DX12 ME)
Good to add search method to MAnalyse: DirectX12 Motion Estimated search. It hopefully will be standartized via Mictosoft DirectX API from different hardware vendors (not NVIDIA only). Currently looks like limited only to Windows 10 build something upper 19041(?).
Description: https://docs.microsoft.com/en-us/windows/win32/medfound/direct3d-video-motion-estimation
doom9 thread: https://forum.doom9.org/showthread.php?t=183517
Example of checking support and init MotionEstimator:
CreateDXGIFactory2(dxgiFactoryFlags, IID_PPV_ARGS(&factory));
ComPtr<IDXGIAdapter1> hardwareAdapter;
GetHardwareAdapter(factory.Get(), &hardwareAdapter);
HRESULT hr = D3D12CreateDevice(
hardwareAdapter.Get(),
D3D_FEATURE_LEVEL_11_0,
IID_PPV_ARGS(&m_device)
);
ComPtr<ID3D12VideoDevice> vid_dev;
HRESULT query_device1_result = m_device->QueryInterface(IID_PPV_ARGS(&vid_dev));
D3D12_FEATURE_DATA_VIDEO_MOTION_ESTIMATOR MotionEstimatorSupport = { 0u, DXGI_FORMAT_NV12 };
HRESULT feature_support = vid_dev->CheckFeatureSupport(D3D12_FEATURE_VIDEO_MOTION_ESTIMATOR, &MotionEstimatorSupport, sizeof(MotionEstimatorSupport));
ComPtr<ID3D12VideoDevice1> vid_dev1;
HRESULT query_vid_device1_result = m_device->QueryInterface(IID_PPV_ARGS(&vid_dev1));
D3D12_VIDEO_MOTION_ESTIMATOR_DESC motionEstimatorDesc = {
0, //NodeIndex
DXGI_FORMAT_NV12,
D3D12_VIDEO_MOTION_ESTIMATOR_SEARCH_BLOCK_SIZE_8X8,
D3D12_VIDEO_MOTION_ESTIMATOR_VECTOR_PRECISION_QUARTER_PEL,
{1920, 1080, 1280, 720} // D3D12_VIDEO_SIZE_RANGE
};
ComPtr<ID3D12VideoMotionEstimator> spVideoMotionEstimator;
HRESULT vid_est_result = vid_dev1->CreateVideoMotionEstimator(
&motionEstimatorDesc,
nullptr,
IID_PPV_ARGS(&spVideoMotionEstimator));```
Currently only support block sizes 8x8 and 16x16 and colour format NV12. But qpel precision that is very slow to calculate using CPU. It easy to downscale to half and full pel.
For development (building) need Windows SDK 10.0.20348.0 or newer.
Some supplementary speed-up work:
Currently MDegrainN uses scattering of the received MV file to some supplementary FakePlaneofBlocks structure possibly for more easily access MV data in the process_luma/chroma actual degraining work (using GetX() GetY() GetSAD() GetMV() functions in use_block_() function). May be it was designed to be compatible with some old datasource of MDegrainN algoriphm. But now this data scattering (from x,y,sad compacted data in MVclip to FakeBlock classes array VECTOR x,y,sad structures) takes significant time and can be omitted if the degraining function will access the received MV file data directly (not using intermediate Fake* object).
There is a change log at the end of the the internal html help, probably there can be historical comments on why it is used. Do you think it takes significant time? The vector count is so low compared to the actual memory access count that the extra time for this move can be neglible? You mean this part is not necessary?
void FakePlaneOfBlocks::Update(const int *array)
{
array += 0;
for ( int i = 0; i < nBlkCount; i++ )
{
blocks[i].Update(array);
array += N_PER_BLOCK;
}
}
Actually it scatters the 12 byte (int mvx, mvy, sad) into 20 bytes (x,y,mvx,mvy,sad)
?Update@FakePlaneOfBlocks@@QEAAXPEBH@Z PROC ; FakePlaneOfBlocks::Update, COMDAT
; 65 : for ( int i = 0; i < nBlkCount; i++ )
xor r8d, r8d
mov r10, rcx
cmp DWORD PTR [rcx+24], r8d
jle SHORT $LN3@Update
lea rax, QWORD PTR [rdx+8]
mov r9d, r8d
npad 13
$LL4@Update:
; File C:\Github\mvtools\Sources\FakeBlockData.h
; 52 : vector.x = array[0];
mov ecx, DWORD PTR [rax-8]
; File C:\Github\mvtools\Sources\FakePlaneOfBlocks.cpp
; 68 : array += N_PER_BLOCK;
lea rax, QWORD PTR [rax+12]
mov rdx, QWORD PTR [r10+56]
lea r9, QWORD PTR [r9+20]
inc r8d
; File C:\Github\mvtools\Sources\FakeBlockData.h
; 52 : vector.x = array[0];
mov DWORD PTR [r9+rdx-12], ecx
; 53 : vector.y = array[1];
mov ecx, DWORD PTR [rax-16]
mov DWORD PTR [r9+rdx-8], ecx
; 54 : vector.sad = *(sad_t *)(&array[2]);
mov ecx, DWORD PTR [rax-12]
mov DWORD PTR [r9+rdx-4], ecx
; File C:\Github\mvtools\Sources\FakePlaneOfBlocks.cpp
; 65 : for ( int i = 0; i < nBlkCount; i++ )
cmp r8d, DWORD PTR [r10+24]
jl SHORT $LL4@Update
$LN3@Update:
; 69 : }
; 70 : }
ret 0
?Update@FakePlaneOfBlocks@@QEAAXPEBH@Z ENDP ; FakePlaneOfBlocks::Update
_TEXT ENDS
Profiling shows some mysterious stall at blocks[i].Update(array); It may be some issue of 'old' CPU Core2 E7500 at my home. First I try to read into SSE register and write to padded x,y,sad+ padding of FakeBlocks class. To make 3(4) integers transfer instead of 3 write operations in FakeBlockData.h: https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/FakeBlockData.h#L51 MV_FORCEINLINE void Update(const int *array) { vector.x = array[0]; vector.y = array[1]; vector.sad = *(sad_t *)(&array[2]); But when even write zeroed xmm register (without reading from 'array') - it still shows awful stall. That may be some issue with memory placement like cache overloading or something else because typically write operations are very good cached and almost invisible.
"Do you think it takes significant time? "
I make profiling with AMD CodeAnalyst (old versions that runs well at Windows 7) with special build of 'superfast MAnalyse' - with disabled SearchMVs() function (simulated hardware infinite-speed MVs search). So most of processing is inside MDegrainN and Update() takes close time to Degrain_sse2().
" The vector count is so low compared to the actual memory access count"
Yes and this is the most magic part of the issue. May be the whole Fake* structure have some wierd placement in memory that cause great CPU stall on simple enough operation of scattering incoming MVs array data into FakeBlocks structure (its vector member x,y, sad data). But unfortunately it is C++ classes hierarchy data and can not be controlled by programmer how it is placed in memory (like simple 'single-array allocation').
"You mean this part is not necessary?"
Not necessary is the beginnig of MDegrainN GetFrame()- https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/MDegrainN.cpp#L886
It looks was attempt to save programming time after moving to supply MVs data from external filter - make some placeholder Fake* structure and reload it with incoming MVs data to use 'old methods' of accessing vector.x, y, sad and coordinates. In the https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/MDegrainN.cpp#L1926 MDegrainN::use_block_y( const int blx = block.GetX() * nPel + block.GetMV().x; const int bly = block.GetY() * nPel + block.GetMV().y; const sad_t block_sad = block.GetSAD(); The GetX(), GetY(), GetMV(), GetSAD() are helper functions of lazy programmers to access real MVs data from the incoming array of MVs (currently scattered-copied to Fake* structure) and need to re-written to access data directly (calculate pointers/offsets etc) . New functions must calculate all needed data (X, Y, pointers to incoming MV data (dx, dy, sad)) at time of Degrain process. It will save time and skip using that Fake* structrure for just temporarily keep copy of incoming MVs data. And skip that strangely slow copy-scatter processing.
Here is typical profiler log with some afwul time in FakeGroupofPlanes::Update()
I see, but I cannot explain it other than it is a glitch in measurement method, or there is a lock or some other big overhead somewhere in between.
Yes, there is a criticalsection (mutex) there
bool FakeGroupOfPlanes::Update(const int *array, int data_size)
{
//::EnterCriticalSection (&cs);
std::lock_guard<std::mutex> lock(cs);
Now the question is: why is it guarded, I suppose it is put there for a good reason.
Unfortunately disabling guarding with mutex do not helps to speed any.
Here is the disassembly of AMD codeanalyst from VisualStudio 2019 compiler build:
That part is quick. r8+14h is the target's 20 byte size. rax + 0ch is the 12 byte size of a source unit. Simple 32 bit read-writes. Cannot make it any faster, moreover this is crazy quick compared to other activities.
As for stall or extremely slow memory access:
Out of buffer reads (writes?) can cause such slowness. Arrgh, I wish I remembered where I met this issue in the last one or two months. Strangely it did not throw any 0xC0000005 access viola error, just when I increased an ending index by one the whole unit became 10x slower than before. As if some internal processor-level exception was silently handled.
Maybe you could run it in a debug build with all boundary checks switched on for that module?
Until then I try to remember when I tortured myself with this problem; it took a day until I finally understood the reason.
I've got it. It was when I implemented atan2 into Expr.
https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/exprfilter/exprfilter.cpp#L387
There is a math helper constants table for sin, cos, log, etc.
Look at this array declaration, atan2 helpers are the last 6 constants. constexpr ExprUnion logexpconst_avx alignas(32)[73][8]
For atan2 calculation I read logexpconst_avx[68] to logexpconst_avx[73]
Unfortunately the table at that time was extended to length of 73 only for sse (there is a 8 byte conterpart of this const table) So the array length was only 68, but I read over it.
I don't know exactly how predefined constants are stored (data segment, memory areas marked as 'code') but I guess the slowness occured because I read outside of allowed data memory limits?
It looks small 'logical' optimization possible: The https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/FakeGroupOfPlanes.cpp#L113
Perform update for all nLvCount levels in the Fake* structure from incoming array of MVs (it looks the array is copy of output results from all search levels in MAnalyse). But MDegrainN uses only level 0 in https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/MDegrainN.cpp#L1925 https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/MVClip.h#L79
So may be all levels 'updating' except 0 can be skipped (for MDegrainN opeartion) and got some speedup. Though the level 0 is largest but sum of all 1+2+3+4+5+.. may be not very small too. Going to check this idea. For speed I frequently limit 'levels' to 2 in MAnalyse and it helps visibly.
Also the hardware-accelerated MAnalyse will provide only largest (level 0) MVs data.
Trying to test idea: do not make copy of MVs data to data of the class FakeBlockData but only save pointer to the 'array' in the class FakePlabeofBlocks and make new functions of GetMV() and GetSAD() Add const int* pArray; to FakePlaneofBlocks class.
The array is copied because as soon as the update is done, the frame content is lost: setting mvF[j] and mvB[j] to zero actually - nullifying the pointers and makes PVideoFrame types freed up.
for (int j = level - 1; j >= 0; j--)
{
mvF[j] = mvClipF[j]->GetFrame(n, env);
mvClipF[j]->Update(mvF[j], env);
isUsableF[j] = mvClipF[j]->IsUsable();
mvF[j] = 0; // v2.0.9.2 - it seems, we do not need in vectors clip anymore when we finished copiing them to fakeblockdatas
}
Well - may be simply make copy of MVs array (per each plane if required) with memcpy() to the special allocated buffer in the each PlaneofBlocks class ? And possibly only 1 plane of level 0 required (+uv planes of level 0). That buffer allocation may be better controlled, can be done via API and not rely on placement of VECTOR structure of FakeBlockData somewhere in the memory.
But it also require for keeping compatibility with curent MDegrainN use_block* function to make new versions of FakeBlockData->GetMV() and FakeBlockData->GetSAD() with access to the common buffer of MVs data. Instead of somehow scattered in memory and somewhere placed array of VECTOR structures of individual FakeBlockData classes. If it too complex in C++ - may be simply delete FakeBlockData class and make required GetMV() and GetSAD() (with GetX and GetY) functions inside PlaneofBlocks class. So the FakePlaneOfBlocks::Update(const int *array) will make 'standard' memcpy to the internal buffer to store for processing.
Will try to make test with memcpy() test to see if at least source can be accessed by CPU fast enough.
Well - the incoming pointer looks were alive but not keeped to the required data and updated somewhere. So the only current version is based on copying to local for FakePlaneofBlocks array (currently looks like allocated in the heap with 'new' C++ operator but may be changed to any other if some memory issues will be found again). This commit - https://github.com/DTL2020/mvtools/commit/1a3cc1542c75654215ddd1fda6850b0d5a9a029c
But it looks to need to update MDegrain1,2,3.. to use this array too ? Or pass some bool param to FakePlaneofBlocks->Update() function to make scatter-copy of MVs to FakeBlockData vector structures to make working old functions of GetMVs, GetSAD.
The only currently still used in MDegrainN silly functions GetX and GetY from FakeBlockData looks like may be also removed and changed to x,y calculation from block index i (+offset of overlapping like in https://github.com/pinterf/mvtools/blob/d8bdff7e02c15a28dcc6e9ef2ebeaa9d16cc1f56/Sources/FakePlaneOfBlocks.cpp#L49 ) and it may save some more memory readings and speed up. Also will make FakeBlockData class completely no-use for MDegrainN processing.
Continuations in https://github.com/DTL2020/mvtools