Surgan Jandial
Surgan Jandial
@bionicles i think it is explanatory except this https://github.com/pytorch/examples/blob/e0929a4253f9ae6ccdde24e787788a9955fdfe1c/dcgan/main.py#L232 might cause trouble
How about , putting the value of best_acc in the shared memory during multiprocessing .
@FedericOldani https://github.com/pytorch/pytorch/blob/44a607b90c9bba0cf268f833bae4715221346709/torch/jit/annotations.py#L33 Check this in the annotations.py of your pytorch version. Probably this is missing .
did you try increasing the num-workers ? maybe something like 16 ?
what is the batch size that u r using ?
I sort of had the same problem but increasing the batch size and num workers did the trick for me
i set the batch size to something around 500 and num_workers as 16
is this getting worked upon ?
convert_checkpoint.py for MPT is not synced with the latest llm_foundry mpt model.