Which is used for BERT training benchmark
Which script is used for BERT training benchmark, I see there are 2 kind of script, one is for pre-train, e.g train.py, the other is for fine tuning, e.g. run_classify.py. Which one is used for benchmark?
@luotao1
We use run_classify.py.
@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?
@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?
@luotao1 Any feedback on this question?
- 1 CPU_NUM: speed xxx
- 16 CPU_NUM: speed xxx
We don't use speed * CPU_NUM, which is for throughput.
then how to measure if speed is comparable with V100 ? e.g. V100: BS=1 speed 3.4steps/s, Xeon: BS=1 8 CPU_NUM: speed 0.43 steps/s
Are they identical?
It is not identical. BS=1 CPU_NUM=8: speed 0.43 steps/s, means: BS=1 CPU_NUM=1, speed 0.43/8 steps/s? And the speed may be not linear with CPU_NUM increases. You can give the result: BS=1 CPU_NUM=ALL
Yes, speed is not linear with CPU_NUM, but I checked code, and find this speed reflects iteration execution time, not really processed samples. It means: for each iteration, the processed samples is actually batchsize * CPU_NUM. I can confirm this.
So my question is for cpu vs. GPU, we may not compare data directly on speed output from log, given CPU_NUM is a virtual concept to use CPU multi-cores , and used to utilize data parallelism, while GPU need discrete card to extend multi-node. This speed is more like latency,
We can give different speed with different CPU_NUM, but how to compare them with GPU fairly, that is what I want to ask.
but how to compare them with GPU fairly, that is what I want to ask.
how about compute samples/s to compare between CPU and GPU?
I see this calculation logic in benchmark run.sh by use samples/s, it counts both CPU_NUM, BS. I think it makes more sense.