(AWS) Rahul Huilgol

Results 20 comments of (AWS) Rahul Huilgol

Can we help collect some debug logs?

This is probably the last step in an epoch. We save additional metrics which Keras only gives us at the end of epoch at that point

step 0 ['RMSprop/decay:0', 'RMSprop/iter:0', 'RMSprop/learning_rate:0', 'RMSprop/momentum:0', 'RMSprop/rho:0', 'accuracy', 'batch', 'dense/weights/dense/bias:0', 'dense/weights/dense/kernel:0', 'dense_1/weights/dense_1/bias:0', 'dense_1/weights/dense_1/kernel:0', 'loss'] step 10 ['accuracy', 'batch', 'dense/weights/dense/bias:0', 'dense/weights/dense/kernel:0', 'dense_1/weights/dense_1/bias:0', 'dense_1/weights/dense_1/kernel:0', 'loss']

Did this get merged? I see two copies of SagemakerSimulator class in the codebase

This low_cpu_mem_usage is a feature in HF to create the model with meta device (no cpu usage for model) and load state_dict so that peak memory usage is not 2x...

I am using Cu11.3 stack though compared to yours

I just tried to reprod issue in deespeed docker image with the snippet @RezaYazdaniAminabadi posted. Single process passed, but multi process didn't. This is the script I used https://github.com/microsoft/DeepSpeed/issues/2230#issuecomment-1219738438 DS...

Created a separate issue for the low_cpu_mem_usage flag issue. https://github.com/microsoft/DeepSpeed/issues/2275 Let's use this for tracking the multi gpu correctness for GPT-J seen above

Is there also any proposed standardization of APIs around partial checkpoints?