Xiulong Yuan

Results 27 issues of Xiulong Yuan

I've been doing research on tensorpipe and the rpc framework based on tensorpipe provided in pytorch for several days and found this is really a great project. I think tensorpipe...

tensorpipe may use dpdk to bypass kernel to avoid memory copy when use desnt have RDMA or EFA?

由于NIC本身具有多个处理器(Process Unit, PU),使用RC模式通信时,每一个QP绑定一个PU处理,为了避免性能卡在PU的处理性能上,我们在实现中使用了单线程多QP的方式来充分利用CPU能力。但同时目前发现单Client进行特征聚合时网络带宽只能用到10.5GB左右,距离12G仍然有2GB的距离,此时的瓶颈主要在于Client的CPU上了(具体可以[查看测试脚本](https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py))。为此我们需要实现MultiThread, MultiQP的模式,避免单CPU瓶颈。同时Thread的个数需要暴露给用户进行设置,默认是1,一般最大设置到2应该就能够完全打满网络。

enhancement
Doing

We should provide GDR mode for more testing

RDMA Scatter/Gather is a nice way to consolidate data transfers. For example, verbs API allows data at multiple locations to be written in a remote buffer with a SINGLE RDMA...

documentation
Doing

# RDMA TLB Results call for help: @Aiemu https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py ## IB Params: ```python POST_LIST_SIZE = 128 CQ_MOD = 1 QP_NUM = 8 TX_DEPTH = 2048 ``` ## FeatureDim = 128,...

experiment
Doing

- [x] 端到端训练精度对齐测试 @eedalong - [ ] Reddit数据集端到端训练 - [ ] Paper100M数据集端到端训练 - [ ] MAG240M数据集端到端训练

experiment