Kan Zhu comments

Results 11 comments of


                                            Kan Zhu

No online serving code yet?

Hi, all the tests including online benchmarks can be invoked by perf.sh.

How to reproduce the benchmark?

We will release the code for reproducing the results in the incoming week.

The output is wrong when using serve.py.

Thanks for your question. Nanoflow works on 8*A100 only for the current version. When less than 8 cards are presented, Nanoflow assumes empty result for the missing GPUs, causing incorrect...

The output is wrong when using serve.py.

4090s do not have Nvlinks to efficiently move data between GPUs. Therefore, the pipeline needs to be re-designed to accommodate long communication time. We will work on supporting Nanoflow with...

Benchmark scripts

We will release the code for reproducing the results in the incoming week.

Regarding GEMV.AG and O.AG

In A100, the all reduce is often implemented as Reduce Scatter and All Gather. In terms of total execution time, AR = 2*AG. However, use GEMV.AG and O.AG allow us...

2AG: Aggregate attention output [num_tokens, 1024] to [num_tokens, 8192], multiply with O [1024, 8192] (N dim partitioned) to get [num_tokens, 1024], then AG to get [num_tokens, 8192]. AR: attention output...

Does this method have the same benefit when tp=1 or tp=2?

When TP is small, almost all the available GPU memory is occupied by model weights. Therefore, the request batch size is reduced, and thus, the batching effect is less significant....

illegal memory access in genEmbedding kernel

Hi, please use our serve_8B.py for experiments. We will rename it in following versions.

A question while reading the paper

We tested our framework on multiple host machines and get similar results. The key is tuning the binding of threads to CPUs /src/computeBound.cu#L100