tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[OpenCL][Texture] Improved texture memory planning

Open srkreddy1238 opened this issue 1 year ago • 12 comments

Motivated form the fact that textures can be allocated over a clBuffer object and the size of backing clBuffer can be computed based on hardware image pitch alignment.

This optimizes the overall memory allocation on device and helps greately the models with large memory requirements.

Improvised the graph memory planner to not differentiate buffer and texture storage tokens and reuse them across. The texture pool in OpenCL runtime is rebranded as memory pool that handles allocation for both buffer and image objects.

NDArray to DeviceAPI interface is extended with AllocDataSpaceView and FreeDataSpaceView. These new API's acommodates accessing same physical memory as clBuffer / clImage objects.

srkreddy1238 avatar Jun 07 '23 19:06 srkreddy1238

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • cc @echuraev, @elvin-n See #10317 for details

Generated by tvm-bot

tvm-bot avatar Jun 07 '23 19:06 tvm-bot

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • cc @echuraev, @elvin-n See #10317 for details

Generated by tvm-bot

tvm-bot avatar Jun 07 '23 19:06 tvm-bot

Quickly took a look at this PR. It contains many changes, but no tests. I'd like to see unit tests for these changes.

I agree. I am working on it.

Appreciate a review on the interface changes (DeviceAPI, NDArray) and over all design aspects mean while.

srkreddy1238 avatar Jun 08 '23 05:06 srkreddy1238

@echuraev can you take a look now ?

srkreddy1238 avatar Jun 12 '23 02:06 srkreddy1238

Thanks @srkreddy1238 . Reading the set of changes, i feel that we need something that is more lifted out from the DeviceAPI since this is getting closer to the two stage concept.

  • Allocator allocates physical Storage object (in this case backed by clBuffer)
  • The storage object then allocates the NDArray(backed by the storage object).

We already have some of similar concept in memory manager, e.g. in relax, but of course the memory manager as it is does not yet reflect the need of supporting things like creating different memory scope. This would help to move some of the complexity in the texture part to a specialized allocator that creates the Storage with related behavior.

The main rationale is to keep the overall DeviceAPI minimum without behavior of things like pooling and move some of the logic to Storage, NDArray separation, that can hopefully also enable direct support in VM backed runtimes like unity. Maybe we can discuss a bit on how to make that part more generalized that can be reused.

tqchen avatar Jun 13 '23 14:06 tqchen

Thanks @echuraev and @tqchen for a quick review.

I see MemoryManager can meet the requirements of the pool concept we have here. I will improve the PR by

  • Initializing the MemoryManager (PooledAllocator) from graph runtime (StorageSetup).
  • NDArray->Empty => PooledAllocator->Empty (Assuming Pooled Allocator works for all runtimes or I cam limit this for texture scoped graphs).
  • NDArray->CreateView => DeviceAPI->AllocDataSpaceView : This will take care of various views over NDArray

Another situation I came across multiple is about runtime::GetDataSize(container->dl_tensor)

Can we move this to DeviceAPI ? In case of texture there is device specific attributes (image row pitch) that defines the underlaying memory size and accessing these attributes outside doesn't look to be a great idea.

srkreddy1238 avatar Jun 14 '23 09:06 srkreddy1238

@tqchen later I realized that the MemoryManager interface is part of relax (unity branch, not mainline). Do I bring bring it into relay and tailor it ? Or Is there any other ideas ?

srkreddy1238 avatar Jun 14 '23 15:06 srkreddy1238

One thing that I can probably do is do do a bit of refactoring and bring it to runtime folder so we can leverage it in most places, I may need a few weeks to get to this. There is also already one in relay that can be used as getting started. GetDataSize is also a logic that can be specific to the pool.

tqchen avatar Jun 14 '23 15:06 tqchen

@tqchen

I enhanced to reuse the existing VM runtime memory manager by graph executor at https://github.com/apache/tvm/pull/15058/commits/31d7de0cb6ee6dc0d1efcc8d53f2e570af7316f0

Basically,

  • Promoted memory manager from runtime/vm to runtime level
  • Redirected graph executor to use Memory Manager interface for StorageSetup.
  • Corresponding changes in OpenCL Runtime to work with this MemoryManager.

Let me know your opinion on these.

Another recommendation I have is redirecting AllocWorkspace and FreeWorkspace to MemoryManager and drop workspace_pool. workspace_pool is slightly different from pooled_allocator (workspace pool looks for best fit where as pooled allocator look for exact fit) and we can amend this functionality.

srkreddy1238 avatar Jul 17 '23 06:07 srkreddy1238

@tqchen @echuraev can you take another look on this ? I have rebased and improvised based on the new memory manager interface.

srkreddy1238 avatar Oct 09 '23 04:10 srkreddy1238

@tqchen & @yongwww can you take a look on this PR ?

srkreddy1238 avatar Oct 20 '23 03:10 srkreddy1238

@tqchen can you take a look on this PR ?

srkreddy1238 avatar Nov 22 '23 05:11 srkreddy1238