Xiulong Yuan issues

Results 27 issues of


                                            Xiulong Yuan

Support async collective op execution

A GREAT project, more people should be aware of it !

I've been doing research on tensorpipe and the rpc framework based on tensorpipe provided in pytorch for several days and found this is really a great project. I think tensorpipe...

Is there any plan to integrate DPDK?

tensorpipe may use dpdk to bypass kernel to avoid memory copy when use desnt have RDMA or EFA?

MultiThread + MultiQP in DistTensorClient

由于NIC本身具有多个处理器(Process Unit, PU)，使用RC模式通信时，每一个QP绑定一个PU处理，为了避免性能卡在PU的处理性能上，我们在实现中使用了单线程多QP的方式来充分利用CPU能力。但同时目前发现单Client进行特征聚合时网络带宽只能用到10.5GB左右，距离12G仍然有2GB的距离，此时的瓶颈主要在于Client的CPU上了（具体可以[查看测试脚本](https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py)）。为此我们需要实现MultiThread， MultiQP的模式，避免单CPU瓶颈。同时Thread的个数需要暴露给用户进行设置，默认是1，一般最大设置到2应该就能够完全打满网络。

enhancement

Doing

Auto Placement

GDR Mode Support

We should provide GDR mode for more testing

About RDMA Scatter/ Gather & RC QP's max_rd_atomic

RDMA Scatter/Gather is a nice way to consolidate data transfers. For example, verbs API allows data at multiple locations to be written in a remote buffer with a SINGLE RDMA...

documentation

Doing

RDMA TLB 在不同特征维度下的测试

# RDMA TLB Results call for help: @Aiemu https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py ## IB Params: ```python POST_LIST_SIZE = 128 CQ_MOD = 1 QP_NUM = 8 TX_DEPTH = 2048 ``` ## FeatureDim = 128,...

experiment

Doing

端到端训练性能测试数据

- [x] 端到端训练精度对齐测试 @eedalong - [ ] Reddit数据集端到端训练 - [ ] Paper100M数据集端到端训练 - [ ] MAG240M数据集端到端训练

experiment