(AWS) Rahul Huilgol comments

Results 20 comments of


                                            (AWS) Rahul Huilgol

Audio is not casting with Video

On Mac as well

Audio is not casting with Video

Can we help collect some debug logs?

tf.keras saves step at end of batch

This is probably the last step in an epoch. We save additional metrics which Keras only gives us at the end of epoch at that point

Keras example saving optimizer variables for step 0 but not step 10

step 0 ['RMSprop/decay:0', 'RMSprop/iter:0', 'RMSprop/learning_rate:0', 'RMSprop/momentum:0', 'RMSprop/rho:0', 'accuracy', 'batch', 'dense/weights/dense/bias:0', 'dense/weights/dense/kernel:0', 'dense_1/weights/dense_1/bias:0', 'dense_1/weights/dense_1/kernel:0', 'loss'] step 10 ['accuracy', 'batch', 'dense/weights/dense/bias:0', 'dense/weights/dense/kernel:0', 'dense_1/weights/dense_1/bias:0', 'dense_1/weights/dense_1/kernel:0', 'loss']

Refactor: Move SagemakerSimulator to test utils

Did this get merged? I see two copies of SagemakerSimulator class in the codebase

[BUG] Inference predictions dont match Huggingface for GPT-J

This low_cpu_mem_usage is a feature in HF to create the model with meta device (no cpu usage for model) and load state_dict so that peak memory usage is not 2x...

[BUG] Inference predictions dont match Huggingface for GPT-J

I am using Cu11.3 stack though compared to yours

[BUG] Inference predictions dont match Huggingface for GPT-J

I just tried to reprod issue in deespeed docker image with the snippet @RezaYazdaniAminabadi posted. Single process passed, but multi process didn't. This is the script I used https://github.com/microsoft/DeepSpeed/issues/2230#issuecomment-1219738438 DS...

[BUG] Inference predictions dont match Huggingface for GPT-J

Created a separate issue for the low_cpu_mem_usage flag issue. https://github.com/microsoft/DeepSpeed/issues/2275 Let's use this for tracking the multi gpu correctness for GPT-J seen above

[RFC]: Torchserve Large Model Inference

Is there also any proposed standardization of APIs around partial checkpoints?