ps_pytorch icon indicating copy to clipboard operation
ps_pytorch copied to clipboard

Why use float_64 for gradient

Open Stonesjtu opened this issue 5 years ago • 6 comments

Hi wang, I'm just wondering why to convert the gradient Tensor into float64, I thought they might be just float32. And it should be more accurate than SGD required.

https://github.com/hwang595/ps_pytorch/blob/89a1cfa136b957073576fae827d39ef0fb09d2fc/src/distributed_worker.py#L258

Stonesjtu avatar Jul 18 '18 07:07 Stonesjtu

Hey Kaiyu, @Stonesjtu. Thanks for pointing this out. There was an issue I faced when developing this prototype that np.float32 can't be fully converted to MPI.FLOAT and vice versa. It could be an potential issue in mpi4py, but I could be wrong. I haven't tested it for the current version combination of pytorch and mpi4py under the gradient compression setting, but I will do that ASAP and report the result under this thread. If you want, you can raise a PR. Any contribution is highly appreciated.

Thanks!

hwang595 avatar Jul 19 '18 09:07 hwang595

I have tested np.float32 without a problem. And I don't quite understand what fully converted means.

Stonesjtu avatar Jul 19 '18 11:07 Stonesjtu

Sorry for being confusing @Stonesjtu. The issue I mentioned was related to this line, which I wrote for an old version where there wasn't any gradient compression strategy and each worker just send the raw gradient matrices as numpy array.

To send numpy array directly, mpi4py provides a series of APIs with capital character e.g. Isend, Irecv, and etc (http://mpi4py.scipy.org/docs/usrman/tutorial.html#point-to-point-communication), where users need to specify datatype in MPI e.g. MPI.FLOAT or MPI.DOUBLE (as I did in this line). The issue is, if .astype(np.float32) and MPI.FLOAT are specified, wrong data will be received on parameter server side. In this case, based on my test, only np.float64 with MPI.DOUBLE works. Please feel free to try it if you're curious.

However, all of the forgoing stuff is with respect to an old version. And you're right, the new version with gradient compression works with np.float32 without any problem. I already made changes on the master branch.

According to my test (on a cluster with 17 m4.2xlarge instances of AWS EC2, 1 parameter server + 16 workers), changing from np.float64 to np.float32 gains approximately 35% speedup on communication, and 11% speedup wrt per iteration runtime.

Thanks a lot for your contribution!

hwang595 avatar Jul 19 '18 18:07 hwang595

So, will you try float16 to see the speedup gain? I think half precision is enough in most cases.

Stonesjtu avatar Jul 20 '18 15:07 Stonesjtu

Actually, I think what's interesting is to add a --half-precision argument. To be more specific, when enabling half-precision, all computation in PyTorch side will be converted to HalfTensor (https://pytorch.org/docs/stable/tensors.html#torch.Tensor.half) and all gradient matrix in numpy will be then converted to np.float16. In that case, both computation and communication will scale better.

If that's what you're suggesting, then yes, I'm planning on it. Please feel free to do it if you want, any PR is appreciated.

hwang595 avatar Jul 20 '18 19:07 hwang595

I do think simply transferring float16 helps a lot to reduce the communication overhead.

Stonesjtu avatar Jul 23 '18 13:07 Stonesjtu