A question while reading the paper
What does UGD refer to in Figure 4 of the paper?
Up Gate Down, the mlp layers in llama.
@serendipity-zk @happierpig this term should be articulated.
Up Gate Down, the mlp layers in llama.
That makes sense, thank you!
another question that came up while reading the paper:
Moving data between the NUMA-affinitive (directly attached) CPU and GPU can lead to 1.27× bandwidth gain compared to non-affinitive ones. NanoFlow ensures the KV-cache is copied to and from the affinitive NUMA node via thread binding.
can you share the NUMA settings on the host machine? I guess it's important to reproduce the benchmarking
We tested our framework on multiple host machines and get similar results. The key is tuning the binding of threads to CPUs /src/computeBound.cu#L100