models
models copied to clipboard
Dlrm benchmark test
dlrm benchmark test scripts
关于下面这些选项:
export CUDA_DEVICE_MAX_CONNECTIONS=32
export ONEFLOW_EP_CUDA_STREAM_FLAGS=1
export ONEFLOW_RAW_READER_PREFETCHING_QUEUE_DEPTH=512
export ONEFLOW_RAW_READER_NUM_WORKERS=1
export LD_PRELOAD=/usr/lib64/libjemalloc.so.1
numactl --interleave=all \
做了一组实验,记录了74000轮的平均latency(ms)结果如下:
ON | OFF |
---|---|
1.41855692 | 1.44409019 |
1.42942288 | 1.43027312 |
1.42626776 | 1.43327031 |
1.43100398 | 1.43726633 |
1.43247646 | 1.43108837 |
1.43085669 | 1.4360571 |
1.4250376 | 1.43052549 |
1.4246417 | 1.44208097 |
1.42638928 | 1.43673026 |
1.43390266 | 1.43774178 |
1.42238418 | 1.43597748 |
1.43701162 | 1.43563187 |
1.42529816 | 1.43994857 |
1.42365005 | 1.43631018 |
1.43174504 | 1.43489774 |
1.42973357 | 1.43393828 |
1.4347752 | |
1.43040477 |
统计结果如下:
ON | OFF | |
---|---|---|
mean | 1.4285 | 1.4360 |
max | 1.4370 | 1.4441 |
min | 1.4186 | 1.4303 |
std | 0.0048 | 0.0039 |
都打开的时候有8us左右的提升,其实很微小,先不保留这些选项。