ps_pytorch
ps_pytorch copied to clipboard
Why use float_64 for gradient
Hi wang,
I'm just wondering why to convert the gradient Tensor into float64
, I thought they might be just float32
. And it should be more accurate than SGD required.
https://github.com/hwang595/ps_pytorch/blob/89a1cfa136b957073576fae827d39ef0fb09d2fc/src/distributed_worker.py#L258
Hey Kaiyu, @Stonesjtu. Thanks for pointing this out. There was an issue I faced when developing this prototype that np.float32
can't be fully converted to MPI.FLOAT
and vice versa. It could be an potential issue in mpi4py
, but I could be wrong. I haven't tested it for the current version combination of pytorch and mpi4py under the gradient compression setting, but I will do that ASAP and report the result under this thread.
If you want, you can raise a PR. Any contribution is highly appreciated.
Thanks!
I have tested np.float32 without a problem. And I don't quite understand what fully converted
means.
Sorry for being confusing @Stonesjtu. The issue I mentioned was related to this line, which I wrote for an old version where there wasn't any gradient compression strategy and each worker just send the raw gradient matrices as numpy array.
To send numpy array directly, mpi4py provides a series of APIs with capital character e.g. Isend
, Irecv
, and etc (http://mpi4py.scipy.org/docs/usrman/tutorial.html#point-to-point-communication), where users need to specify datatype in MPI e.g. MPI.FLOAT
or MPI.DOUBLE
(as I did in this line). The issue is, if .astype(np.float32)
and MPI.FLOAT
are specified, wrong data will be received on parameter server side. In this case, based on my test, only np.float64
with MPI.DOUBLE
works. Please feel free to try it if you're curious.
However, all of the forgoing stuff is with respect to an old version. And you're right, the new version with gradient compression works with np.float32
without any problem. I already made changes on the master branch.
According to my test (on a cluster with 17 m4.2xlarge instances of AWS EC2, 1 parameter server + 16 workers), changing from np.float64
to np.float32
gains approximately 35% speedup on communication, and 11% speedup wrt per iteration runtime.
Thanks a lot for your contribution!
So, will you try float16
to see the speedup gain? I think half precision is enough in most cases.
Actually, I think what's interesting is to add a --half-precision
argument. To be more specific, when enabling half-precision
, all computation in PyTorch side will be converted to HalfTensor
(https://pytorch.org/docs/stable/tensors.html#torch.Tensor.half) and all gradient matrix in numpy will be then converted to np.float16
. In that case, both computation and communication will scale better.
If that's what you're suggesting, then yes, I'm planning on it. Please feel free to do it if you want, any PR is appreciated.
I do think simply transferring float16
helps a lot to reduce the communication overhead.