dalle-flow
dalle-flow copied to clipboard
Cannot connect to dalle when run in docker
Hello, thanks for sharing this wonderful project.
I had a problem there, I tried to run it in docker and access it locally. The docker build and run process is smooth, but when I started the client and tried to access it locally, this error occurs: ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment at address 0.0.0.0:49336. Head or worker may be down. I checked the port and see it should be the port of dalle as: gateway/rep-0@60 adding connection for deployment dalle/heads/0 to grpc://0.0.0.0:49336
Any idea on how I could fix this? Thank you so much.
Having the same issue on AWS Deep Learning AMI GPU PyTorch 1.11.0 (Amazon Linux 2) 20220328 (CUDA 116).
we recently fixed in the Dockerfile in #20 , you could give it a try
this should solve the problem as we have successfully run it on p2.8xlarge
, @jina-ai/engineering will share more details next week.
Hey @nthomsencph , does nvcc -v
, nvidia-smi
and torch.cuda.device_count()
print results correctly inside the ec2 instance and inside the docker image ?
(In order to get inside the docker image and run the commands you can do docker run -it --entypoint /bin/bash jina-ai/dalle-flow
)
Now it works 🔥
Rebooted ec2 instance, ran docker prune -a
, pulled repo and ran instructions. Thanks!
Hi @nthomsencph I'm having the same issue on a g5x.large, which EC2 instance are you using? Which instructions did you follow to install the nvidia toolkit on docker? How did you install the cudnn8 inside docker?
Thanks!
Hi @spuliz. We sprung for a AWS Deep Learning AMI (One which comes with CUDA116 and more - See above) to skip the hassle of configuring this.
Thanks @nthomsencph which EC2 instance did you use? I am having an issue with the Tesla K-series GPUs as your AMI does not have the NVDIA drivers already installed. The issue I am having is that I am not able to find an AMI with cuda 11.6 installed
We used a p1.large with a 16GB GPU. No more is necessary since we don't expect too many requests. The Deep Learning AMI we use for this have CUDA 116 preinstalled.
On honeymoon so that's all the help I can offer ☀️
Did you try building docker and run it via docker container? I just rebuild and run without any issue.
https://github.com/jina-ai/dalle-flow#run-in-docker
git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow
docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .
docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow
I believe this issue has been resolved. Feel free to reopen if the problem occurs again.