gloo icon indicating copy to clipboard operation
gloo copied to clipboard

benchmark --verify error

Open Luo-Liang opened this issue 6 years ago • 2 comments

Hi!

I'm testing the benchmark program. When I use the --verify flag, I am getting some complaints. what(): [enforce fail at /home/ubuntu/gloo/gloo/benchmark/main.cc:91] T(offset + expected) == input[i]. 2.4e+07 vs 2.4e+07. Mismatch at index: 375000 terminate called after throwing an instance of 'gloo::EnforceNotMet'

The command I used is: benchmark -s ${totalClients} -r ${idx} -h xxx.xxx.xxx.xxx -p 6379 -t tcp --sync true --inputs 1 --elements 100\ 0000 --iteration-count 1 --verify allreduce_ring_chunked

I ran this across 8 machines, so ${totalClients}=8, and ${idx} range from 0-7.

Did I do something obviously wrong?

This is running on Ubuntu, and has the latest master checked out.

Thanks!

Luo-Liang avatar Apr 05 '18 06:04 Luo-Liang

Hi there! Thanks for reporting the issue. I think this is expected given the number of elements you're using in your test in combination with the number of machines. The verification code fills all input buffers with sequential values, offset by the collective rank, and strided by the collective size. I think that at index 375000 we run into numerical mismatches for floating point numbers. To fix this we should only use integer math in verification mode.

pietern avatar Apr 05 '18 15:04 pietern

Thank you! :)

Luo-Liang avatar Apr 07 '18 03:04 Luo-Liang