Kan Zhu
Kan Zhu
Hi, all the tests including online benchmarks can be invoked by perf.sh.
We will release the code for reproducing the results in the incoming week.
Thanks for your question. Nanoflow works on 8*A100 only for the current version. When less than 8 cards are presented, Nanoflow assumes empty result for the missing GPUs, causing incorrect...
4090s do not have Nvlinks to efficiently move data between GPUs. Therefore, the pipeline needs to be re-designed to accommodate long communication time. We will work on supporting Nanoflow with...
We will release the code for reproducing the results in the incoming week.
In A100, the all reduce is often implemented as Reduce Scatter and All Gather. In terms of total execution time, AR = 2*AG. However, use GEMV.AG and O.AG allow us...
2AG: Aggregate attention output [num_tokens, 1024] to [num_tokens, 8192], multiply with O [1024, 8192] (N dim partitioned) to get [num_tokens, 1024], then AG to get [num_tokens, 8192]. AR: attention output...
When TP is small, almost all the available GPU memory is occupied by model weights. Therefore, the request batch size is reduced, and thus, the batching effect is less significant....
Hi, please use our serve_8B.py for experiments. We will rename it in following versions.
We tested our framework on multiple host machines and get similar results. The key is tuning the binding of threads to CPUs /src/computeBound.cu#L100