pytorch memory performance gap compared to tc_malloc. (Windows)
Hi Microsoft Mimalloc team,
Here is a good news to you, that I have worked with Meta pytorch team and integrated mimalloc(stable v1.8.2) into pytorch, and use mimalloc boost pytorch performance on Windows. The enable PR is merged: https://github.com/pytorch/pytorch/pull/102595
But, mimalloc still have some performance gap compared to tc_malloc. The summary information, please check here: https://github.com/pytorch/pytorch/issues/102534 Option 1, tc_malloc can run test case. and use 2.9s. Option 2, mimalloc run the same test case, and need 3.9s.
Could you please help me continue to optimize the mimalloc and boost pytorch to faster?
test case:
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
import time
model = models.resnet18()
device = torch.device("cpu")
model.to(device)
inputs = torch.randn(5, 3, 224, 224).to(device)
start = time.time()
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
for _ in range(100):
model(inputs)
end = time.time()
print(prof.key_averages().table(sort_by="cpu_time_total"))
print("Execution time:", end - start)
Option 1: pytorch with tc_malloc, please my prototype, repo and branch link: https://github.com/xuhancn/pytorch/tree/xu_cpu_alloc_via_tc_malloc
Option 2: pytorch with mimalloc, please check offical pytorch repo, main branch: https://github.com/pytorch/pytorch
You'd better reproduce the issue on Intel Xeon processor server. Debug history is here: https://github.com/pytorch/pytorch/issues/62387
I tried to enable large_page in mimalloc to improve pytorch performance and have some update.
-
It is failed to "AdjustTokenPrivileges", and return status "ERROR_NOT_ALL_ASSIGNED", error code: 1300 (0x514). It is never initial "large_page_size", at line 94. Accodring MSDN "Large-Page Support", It is failed at step 1. It seems that, we can't enable "large page" in mimalloc.
-
I tried to force let "large_page_size" equals 2MB, as line 105. I found the pytorch performance real improved to 2.9s. It as good as tc_malloc.
I guess whether the "large_page_size" equals 2MB triggered same large page logical, it is match to pytorch tensor usage.
In pytorch scenario, pytorch only call mimalloc to alloc tensor, without override original system malloc functions. Usually, tensors are large size memory. Due to pytorch have some operations need temp buffers, such as "reorder" and "make contiguous". It maybe need many large memory malloc and free frequently.
Whether mimalloc we can add some optimize to support this scenario? I believe mimalloc has protential to be optinized as well as tc_malloc.