XIE Xuan comments

Results 8 comments of


                                            XIE Xuan

天枢大规模分布式训练评测报告

# 使用ansible进行oneflow分布式训练之前的DLPerf中使用了shell脚本通过ssh进行oneflow分布式训练，DLPerf关注性能，并且根据不同条件需要测试几十、几百、上千个测试案例，自动化测试、可回溯可复现是测试的基本要求。Ansible是一个大规模构建和运维 IT 自动化平台（工具），使用Ansible可以简化和自动化这些oneflow分布式训练测试。 ### inventory 文件 Ansible可同时操作属于一个组的多台主机，组和主机之间的关系通过 `inventory` 文件配置. 默认的文件路径为 `/etc/ansible/hosts`，除默认文件外,你还可以同时使用或者指定其他 `inventory` 文件。根据DLPerf的需求，我们的 `inventory` 文件按照节点数进行分组，组名以节点数为区分，例子如下： ```ini [hosts_1] 10.244.111.4 [hosts_2] 10.244.111.4 10.244.1.14 [hosts_4] 10.244.111.4 10.244.1.14 10.244.1.15 10.244.1.16 ``` 其中`hosts_*`中的`*`指代节点数，方便选取使用。...

评测报告中没有准确率等指标

非常感谢您的关注！ DLPerf着重性能方面的指标，[OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark)仓库里面有准确率方面的指标: - resnet50的ImageNet上的准确率请参考[这里](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/Classification/cnns#%E9%A2%84%E8%AE%AD%E7%BB%83%E6%A8%A1%E5%9E%8B) - BERT的几个下游任务的打分请参考[这里](https://github.com/Oneflow-Inc/OneFlow-Benchmark/tree/master/LanguageModeling/BERT)

暂时不支持nn.pairwisedistance() 及 Variable

可以参考[这一小段](https://github.com/Oneflow-Inc/models/blob/ctr_benchmark_test/RecommenderSystems/dlrm/dlrm_train_eval.py#L273-L279) ``` scales = np.sqrt(1 / np.array(table_size_array)) tables = [ flow.one_embedding. make_table_options( flow.one_embedding.make_uniform_initializer(low=-scale, high=scale) ) for scale in scales ] ``` @wjy3326

Dlrm benchmark test

关于下面这些选项： ``` export CUDA_DEVICE_MAX_CONNECTIONS=32 export ONEFLOW_EP_CUDA_STREAM_FLAGS=1 export ONEFLOW_RAW_READER_PREFETCHING_QUEUE_DEPTH=512 export ONEFLOW_RAW_READER_NUM_WORKERS=1 export LD_PRELOAD=/usr/lib64/libjemalloc.so.1 numactl --interleave=all \ ``` 做了一组实验，记录了74000轮的平均latency(ms)结果如下： ON | OFF -- | -- 1.41855692 | 1.44409019 1.42942288 | 1.43027312 1.42626776...

XIE Xuan

天枢大规模分布式训练评测报告

评测报告中没有准确率等指标

暂时不支持nn.pairwisedistance() 及 Variable

Dlrm benchmark test

CNN benchmark cannot run

CNN benchmark cannot run

未来需要合并的算法

Dev deepfm for check