ATOMO icon indicating copy to clipboard operation
ATOMO copied to clipboard

MPI TRUNCATED when run imagenet dataset on resnet18

Open GeKeShi opened this issue 5 years ago • 2 comments

hello, I‘m trying to test this code on imagenet, but I find that when the program runs to self.comm.Bcast([self.model_recv_buf.recv_buf[layer_idx], MPI.DOUBLE], root=0) in function async_fetch_weights_bcast in distributed_worker.py at the first step, it thrown an error that is MPI_ERR_TRUNCATE: message truncated , but I check the memory size in Bcast and it works when the program ran on Cifar10/100, have u encountered this problem?

And another issue: then I replaced the Pytorch0.3.0 with Pytorch0.4/1.1, the proceeding time on decode of QSGD is significantly higher than 0.3.0, almost 10 times than it, have u tried this?

GeKeShi avatar May 24 '19 15:05 GeKeShi

@GeKeShi sorry for this late response.

i) For the first issue your reported. The error usually occurs when the sizes your local receiving buffer (on PS or worker nodes) are smaller than the sizes of the messages sent from other source nodes. But that issue should only related to the model you're using (however it seems not to be the case on your end i.e. it works for CIFAR-10/100, but not for ImageNet). Can you share more details on this issue? e.g. pointing your fork to me. Also, the following change might help, i.e. changing the async_fetch_weights_bcast function to this version: https://github.com/hwang595/ps_pytorch/blob/master/src/distributed_worker.py#L221-L231. i.e. compressing the model using a lossless compression tool. In that way, each node can maintain a smaller receiving buffer locally. But please note that, you also need corresponding change on the PS end, i.e. https://github.com/hwang595/ps_pytorch/blob/master/src/sync_replicas_master_nn.py#L218-L225.

ii) The decoding function was written in Numpy. So I don't have a clue currently why the version of PyTorch can influence your speed. Can you also share more details about the Numpy and Python version information on your end? Also, did you try to locate which part is the bottleneck of the performance?

Hope these are helpful.

hwang595 avatar Jun 02 '19 22:06 hwang595

Thanks for your reply, here are some details i) I change the model implementation with that from torchvision and the size mismatching was solved ii) the Numpy is 1.12.1 and Pytorch is 1.1.0, the program output in training cifar10 in resnet18 is as follow:

Worker: 2, Step: 98, Epoch: 0 [3104/50000 (6%)], Loss: 1.8777, Time Cost: 5.1246, Comp: 0.0225, Encode:  4.9213, Comm:  0.0906, Msg(MB):  25.5414, Prec@1:  25.0000, Prec@5:  87.5000
Worker: 1, Step: 98, Epoch: 0 [3104/50000 (6%)], Loss: 2.0034, Time Cost: 5.2109, Comp: 0.0224, Encode:  4.9971, Comm:  0.1011, Msg(MB):  25.5444, Prec@1:  18.7500, Prec@5:  81.2500
Master: Step: 98, Decode Cost: 130.71253109, Cur lr 0.0095, Gather: 5.15524792671

meanwhile, the output from pytorch0.3.0 is:

Worker: 1, Step: 432, Epoch: 1 [5120/50000 (10%)], Loss: 1.0311, Time Cost: 6.9317, Comp: 0.6725, Encode:  5.8544, Comm:  0.1337, Msg(MB):  30.4401, Prec@1:  64.0625, Prec@5:  96.0938
Worker: 2, Step: 432, Epoch: 1 [5120/50000 (10%)], Loss: 1.2026, Time Cost: 7.0075, Comp: 0.8195, Encode:  5.7536, Comm:  0.1939, Msg(MB):  30.4277, Prec@1:  53.9062, Prec@5:  93.7500
Master: Step: 432, Decode Cost: 14.2414638996, Cur lr 0.00663420431289, Gather: 6.93086600304

GeKeShi avatar Jun 10 '19 08:06 GeKeShi