ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Ray trained AlexNet model performance is worse than locally trained

Open vijayi1 opened this issue 1 year ago • 1 comments

I'm training an AlexNet encoder for an image classification problem. The model performance is worse with a ray backend and num_workers=4, compared to when it is locally trained or ray with num_workers=1.

I've created a test case based on the MNIST example. The test data consists of 300 mnist images (100 images each of digits 7,8 and 9). I've attached the test program and the program outputs for the following - (a) local backend - model correctly predicts 28 of 30. (b) ray with num_workers=1 - model correctly predicts 26 of 30. (c) ray with num_workers=4 - model correctly predicts 10 of 30, and the predictions appear to be constant.

In the case of (c), increasing the epochs sometimes gives better results and at other times the same as (c). "horovod" is slightly better than "ddp". num_workers=2 gives better results than 4, but not as good as (a) or (b).

Running on kubernetes containers versions: python 3.8.16 ludwig 0.8.2 ray 2.3.1 torch 2.0.1 horovod 0.28.1

mnist_alexnet.py.txt mnist2.csv

local_backend.txt ray_1_worker.txt ray_4_workers.txt

vijayi1 avatar Nov 27 '23 02:11 vijayi1

Hi @vijayi1– taking a look!

geoffreyangus avatar Dec 12 '23 20:12 geoffreyangus