Xiulong Yuan
Xiulong Yuan
I've been doing research on tensorpipe and the rpc framework based on tensorpipe provided in pytorch for several days and found this is really a great project. I think tensorpipe...
tensorpipe may use dpdk to bypass kernel to avoid memory copy when use desnt have RDMA or EFA?
由于NIC本身具有多个处理器(Process Unit, PU),使用RC模式通信时,每一个QP绑定一个PU处理,为了避免性能卡在PU的处理性能上,我们在实现中使用了单线程多QP的方式来充分利用CPU能力。但同时目前发现单Client进行特征聚合时网络带宽只能用到10.5GB左右,距离12G仍然有2GB的距离,此时的瓶颈主要在于Client的CPU上了(具体可以[查看测试脚本](https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py))。为此我们需要实现MultiThread, MultiQP的模式,避免单CPU瓶颈。同时Thread的个数需要暴露给用户进行设置,默认是1,一般最大设置到2应该就能够完全打满网络。
We should provide GDR mode for more testing
RDMA Scatter/Gather is a nice way to consolidate data transfers. For example, verbs API allows data at multiple locations to be written in a remote buffer with a SINGLE RDMA...
# RDMA TLB Results call for help: @Aiemu https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py ## IB Params: ```python POST_LIST_SIZE = 128 CQ_MOD = 1 QP_NUM = 8 TX_DEPTH = 2048 ``` ## FeatureDim = 128,...
- [x] 端到端训练精度对齐测试 @eedalong - [ ] Reddit数据集端到端训练 - [ ] Paper100M数据集端到端训练 - [ ] MAG240M数据集端到端训练