sergii-ivakhno-kidsloop

Results 4 issues of sergii-ivakhno-kidsloop

I am getting timeout after launching Fargate cluster `dask_cloudprovider.utils.timeout.TimeoutException: Failed to find scheduler ip address after 120 seconds.` I know this error has been reported before, but in my case...

bug

*Concise Description:* Torch does not find Cuda on GPU instance and official SageMaker training container *DLC image/dockerfile:* sudo docker pull 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker *Current behavior:* ``` sudo docker pull 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker sudo docker...

I have set-up the Ray GPU cluster (g4dn.12xlarge workers) with `ray up config.yaml` and have installed all package dependencies within ray config. However when I attempt distributed training I `MisconfigurationException:...

**Describe the bug** Torch does not find Cuda on GPU instance and official SageMaker training container **To reproduce** ``` sudo docker pull 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker sudo docker run -it --entrypoint /bin/bash 709fa9395949...