mixtral-offloading
mixtral-offloading copied to clipboard
Hard to benchmark the operation in the repo
Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.
For example, I believe the data copy time is those two lines.
device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)
And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?
Many thanks!