mixtral-offloading Hard to benchmark the operation in the repo

Hard to benchmark the operation in the repo

Open mynotwo opened this issue 5 months ago • 1 comments

Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.

For example, I believe the data copy time is those two lines.

    device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
    offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)

And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?

Many thanks!

Aug 29 '24 02:08 mynotwo

mixtral-offloading mixtral-offloading copied to clipboard

Hard to benchmark the operation in the repo

mixtral-offloading
mixtral-offloading copied to clipboard