dalle-flow icon indicating copy to clipboard operation
dalle-flow copied to clipboard

Cannot connect to dalle when run in docker

Open AstrocyteTaki opened this issue 2 years ago • 9 comments

Hello, thanks for sharing this wonderful project.

I had a problem there, I tried to run it in docker and access it locally. The docker build and run process is smooth, but when I started the client and tried to access it locally, this error occurs: ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment at address 0.0.0.0:49336. Head or worker may be down. I checked the port and see it should be the port of dalle as: gateway/rep-0@60 adding connection for deployment dalle/heads/0 to grpc://0.0.0.0:49336

Any idea on how I could fix this? Thank you so much.

AstrocyteTaki avatar May 18 '22 03:05 AstrocyteTaki

Having the same issue on AWS Deep Learning AMI GPU PyTorch 1.11.0 (Amazon Linux 2) 20220328 (CUDA 116).

nthomsencph avatar May 18 '22 08:05 nthomsencph

we recently fixed in the Dockerfile in #20 , you could give it a try

this should solve the problem as we have successfully run it on p2.8xlarge, @jina-ai/engineering will share more details next week.

hanxiao avatar May 21 '22 07:05 hanxiao

Hey @nthomsencph , does nvcc -v, nvidia-smi and torch.cuda.device_count() print results correctly inside the ec2 instance and inside the docker image ? (In order to get inside the docker image and run the commands you can do docker run -it --entypoint /bin/bash jina-ai/dalle-flow)

alaeddine-13 avatar May 21 '22 09:05 alaeddine-13

Now it works 🔥

Rebooted ec2 instance, ran docker prune -a, pulled repo and ran instructions. Thanks!

nthomsencph avatar May 22 '22 11:05 nthomsencph

Hi @nthomsencph I'm having the same issue on a g5x.large, which EC2 instance are you using? Which instructions did you follow to install the nvidia toolkit on docker? How did you install the cudnn8 inside docker?

Thanks!

spuliz avatar May 23 '22 22:05 spuliz

Hi @spuliz. We sprung for a AWS Deep Learning AMI (One which comes with CUDA116 and more - See above) to skip the hassle of configuring this.

nthomsencph avatar May 24 '22 02:05 nthomsencph

Thanks @nthomsencph which EC2 instance did you use? I am having an issue with the Tesla K-series GPUs as your AMI does not have the NVDIA drivers already installed. The issue I am having is that I am not able to find an AMI with cuda 11.6 installed

spuliz avatar May 26 '22 13:05 spuliz

We used a p1.large with a 16GB GPU. No more is necessary since we don't expect too many requests. The Deep Learning AMI we use for this have CUDA 116 preinstalled.

On honeymoon so that's all the help I can offer ☀️

nthomsencph avatar May 30 '22 08:05 nthomsencph

Did you try building docker and run it via docker container? I just rebuild and run without any issue.

https://github.com/jina-ai/dalle-flow#run-in-docker

git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow

docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .

docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow

hanxiao avatar Jun 11 '22 19:06 hanxiao

I believe this issue has been resolved. Feel free to reopen if the problem occurs again.

delgermurun avatar Oct 07 '22 13:10 delgermurun