llama-recipes
llama-recipes copied to clipboard
llama 2 distributed training on AWS Sagemaker
Hi, I am going to do distributed training of llama on aws sagemaker as managed training across multiple devices/nodes. Sagemaker provides data parallel and model parallel distributed training in sagemaker. SInce sagemaker already takes care of distributed training, do i need to keep current FSDP implementation of llama fine tuning script? or should i remove it?