Matthew Nightingale
Matthew Nightingale
### Ask your question Hi, I am hoping to understand the difference between the `dcgmi -v` version and the version of `dcgm exporter` which should be used. I want to...
Running the script [3.test_cases/10.FSDP/1.distributed-training.sbatch](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/1.distributed-training.sbatch) on 2 p5 nodes, and the job is failing at validation step after 500 batches. [slurm-47.log](https://github.com/aws-samples/awsome-distributed-training/files/15371088/slurm-47.log) ``` 0: OSError: [Errno 12] Cannot allocate memory ``` **Configuration:**...
``` 7: [rank80]: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10) ``` Running FSDP example, 16 p5 nodes. The example worked with 8 nodes