ludwig
ludwig copied to clipboard
Ray trained AlexNet model performance is worse than locally trained
I'm training an AlexNet encoder for an image classification problem. The model performance is worse with a ray backend and num_workers=4, compared to when it is locally trained or ray with num_workers=1.
I've created a test case based on the MNIST example. The test data consists of 300 mnist images (100 images each of digits 7,8 and 9). I've attached the test program and the program outputs for the following - (a) local backend - model correctly predicts 28 of 30. (b) ray with num_workers=1 - model correctly predicts 26 of 30. (c) ray with num_workers=4 - model correctly predicts 10 of 30, and the predictions appear to be constant.
In the case of (c), increasing the epochs sometimes gives better results and at other times the same as (c). "horovod" is slightly better than "ddp". num_workers=2 gives better results than 4, but not as good as (a) or (b).
Running on kubernetes containers versions: python 3.8.16 ludwig 0.8.2 ray 2.3.1 torch 2.0.1 horovod 0.28.1
Hi @vijayi1– taking a look!