awsome-distributed-training
awsome-distributed-training copied to clipboard
FSDP Example ReadTimeoutError
7: [rank80]: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)
Running FSDP example, 16 p5 nodes. The example worked with 8 nodes