Nanoflow A question while reading the paper

What does UGD refer to in Figure 4 of the paper?

Aug 27 '24 08:08 JasonJ2021

Up Gate Down, the mlp layers in llama.

Aug 27 '24 08:08 yzh119

@serendipity-zk @happierpig this term should be articulated.

Aug 27 '24 08:08 yzh119

Up Gate Down, the mlp layers in llama.

That makes sense, thank you！

Aug 27 '24 10:08 JasonJ2021

another question that came up while reading the paper:

Moving data between the NUMA-affinitive (directly attached) CPU and GPU can lead to 1.27× bandwidth gain compared to non-affinitive ones. NanoFlow ensures the KV-cache is copied to and from the affinitive NUMA node via thread binding.

can you share the NUMA settings on the host machine? I guess it's important to reproduce the benchmarking

Sep 06 '24 05:09 wDevil

We tested our framework on multiple host machines and get similar results. The key is tuning the binding of threads to CPUs /src/computeBound.cu#L100

Oct 31 '24 19:10 serendipity-zk