MichaelHsu170
MichaelHsu170
They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop (`sudo iftop -n`).
By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes? For example, how many workers should we start, how many servers should...
Hi @ymjiang , Happy Chinese New Year!!! Sorry, for scenario 2 it is 3.4 img/sec. I'll try the benchmark tool for networking performance measurement. Thank you.
Hi @ymjiang , We tried the basic benchmark mentioned in https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark, but got failures. Could you suggest how can we get it working? Thank you. We ran 2 scenarios: 1....
It shown unlimited. `$ ulimit -l unlimited`
I tried this scenario on 2 machines: machine A: scheduler, server, worker machine B: server, worker But still processes on machine B crashed with error message ` what(): [08:40:36] src/./rdma_transport.h:144:...
If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with `terminate called after throwing an instance of 'dmlc::Error' what(): [09:45:09] src/./rdma_van.h:747: Check failed: 0 OnEvent:...
Hi @ymjiang , any recommendation will be grateful. Thank you.