alpa [FEATURE] Plan to improve the overall cross mesh resharding speed

[FEATURE] Plan to improve the overall cross mesh resharding speed

Open TarzanZhao opened this issue 2 years ago • 3 comments

Mentioned in issue 416, one cross mesh resharding task is composed of many "N-to-M data send task" and each "N-to-M data send task" could be handled by one NCCL Broadcast operation. To speed up the overall communication in Alpa, we need to schedule these many cross mesh resharding tasks and schedule many Broadcast operation inside each task.

Take the programming difficulty into consideration, I have the following two improvement direction as the first step:

For one task, it's possible to run multiple broadcast operations concurrently as long as these broadcasts do not share involved hosts. Here I mean if senders of two broadcasts reside in the same host, then one broadcast must wait the other broadcast to finish first. Alpa current implementation runs broadcast one by one without concurrency, so there is space for improvement by introducing concurrency here. A naive solution is by using multi-thread to submit many broadcast tasks at the same time and let them compete communication bandwidth for themselves. I want to implement this first because it sounds easy. Actually, we could further schedule the precise time for each broadcast operation to happen. This could be formalized into a well defined problem and design some fancy algorithms to solve it, I prefer to implement the fancy algorithm later because it has many details and thus hard to program and might have large overhead due to its complexity. More precisely, I want to implement with NCCL Aggregated Operations in one GroupCall.
For many resharding tasks, we want to run some of these tasks concurrently to improve the overall speed. Current implementation inside PipelineMeshWorkerExecutable will execute instructions sequentially, here instructions involve FREE, RUN, BROADCAST and so on. I want to model the dependency of these instructions, then Instead of running sequentially, we run according to the topology of dependency graph for these instructions. In this graph, some instructions do not depend on each other, we could certainly run them concurrently. As for how I implement this concurrency, I have two choices:
1. Use multithread inside PipelineMeshWorkerExecutable. One instruction will use one thread. If one thread/instruction finished, it will check whether its subsequent instructions in the dependency graph are ready to execute. Repeat until all instructions are finished. I want to refer to AsyncIO / Concurrency for Actors
2. In current implementation, one PipelineMeshWorkerExecutable with a sequence of instructions in, resides in one Ray actor. I could change to: one instruction resides in one Ray actor. I could design some communication strategies to make sure that they indeed run according to the dependency graph, or I could achieve this by a centralized scheduler.

Because each device mesh only involves one host in Alpa's current examples, the first improvement will not give any speedup now because no concurrency is possible to take advantage of there. The second improvement will definitely give speedup more or less because it will overlap communication and computation.

I am gonna implement this. If anyone has advices, I will really appreciate.

May 25 '22 22:05 TarzanZhao

For the second part, you do not need to use any CPU concurrency/multithreading features. You can just create multiple CUDA streams and correctly sync the streams to resolve data dependency. The CPU dispatches CUDA kernels to different streams (e.g., one stream for computation, one stream for send, one stream for receive). GPU can execute the kernels in different streams in parallel.

May 30 '22 11:05 merrymercy

Thanks! You are right. This seems much easier.

May 30 '22 11:05 TarzanZhao

Does memory management require some CPU concurrency? e.g. a buffer's last use is a SEND and now it is launched. The next inst is a FREE. Alpa does not actually manage the GPU memory but deletes the buffer's ref in the worker's dict in FREE inst. So the timeline might be: CPU: launch SEND > launch FREE > delete cpu buffer ref > delete pybuffer in XLA > XLA recycles the buffer's memory GPU: send the buffer at stream s_1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not done now?

Jun 01 '22 15:06 ZYHowell

See this paper by Alpa team: https://arxiv.org/pdf/2211.05322.pdf

Related PR: #773

Nov 15 '22 07:11 zhisbug

alpa alpa copied to clipboard

[FEATURE] Plan to improve the overall cross mesh resharding speed

alpa
alpa copied to clipboard