benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

Which is used for BERT training benchmark

Open LeoZhao-Intel opened this issue 6 years ago • 10 comments

Which script is used for BERT training benchmark, I see there are 2 kind of script, one is for pre-train, e.g train.py, the other is for fine tuning, e.g. run_classify.py. Which one is used for benchmark?

LeoZhao-Intel avatar May 22 '19 07:05 LeoZhao-Intel

@luotao1

LeoZhao-Intel avatar May 22 '19 07:05 LeoZhao-Intel

We use run_classify.py.

luotao1 avatar May 22 '19 10:05 luotao1

@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?

LeoZhao-Intel avatar May 23 '19 08:05 LeoZhao-Intel

@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?

@luotao1 Any feedback on this question?

LeoZhao-Intel avatar May 28 '19 06:05 LeoZhao-Intel

  • 1 CPU_NUM: speed xxx
  • 16 CPU_NUM: speed xxx

We don't use speed * CPU_NUM, which is for throughput.

luotao1 avatar May 28 '19 06:05 luotao1

then how to measure if speed is comparable with V100 ? e.g. V100: BS=1 speed 3.4steps/s, Xeon: BS=1 8 CPU_NUM: speed 0.43 steps/s

Are they identical?

LeoZhao-Intel avatar May 28 '19 06:05 LeoZhao-Intel

It is not identical. BS=1 CPU_NUM=8: speed 0.43 steps/s, means: BS=1 CPU_NUM=1, speed 0.43/8 steps/s? And the speed may be not linear with CPU_NUM increases. You can give the result: BS=1 CPU_NUM=ALL

luotao1 avatar May 28 '19 07:05 luotao1

Yes, speed is not linear with CPU_NUM, but I checked code, and find this speed reflects iteration execution time, not really processed samples. It means: for each iteration, the processed samples is actually batchsize * CPU_NUM. I can confirm this.

So my question is for cpu vs. GPU, we may not compare data directly on speed output from log, given CPU_NUM is a virtual concept to use CPU multi-cores , and used to utilize data parallelism, while GPU need discrete card to extend multi-node. This speed is more like latency,

We can give different speed with different CPU_NUM, but how to compare them with GPU fairly, that is what I want to ask.

LeoZhao-Intel avatar May 28 '19 08:05 LeoZhao-Intel

but how to compare them with GPU fairly, that is what I want to ask.

how about compute samples/s to compare between CPU and GPU?

luotao1 avatar May 29 '19 04:05 luotao1

I see this calculation logic in benchmark run.sh by use samples/s, it counts both CPU_NUM, BS. I think it makes more sense.

LeoZhao-Intel avatar May 29 '19 05:05 LeoZhao-Intel