dalle-flow
dalle-flow copied to clipboard
Docker error
I have this problem running docker with command:
docker run -p 51005:51005 -v $HOME/.cache:/root/.cache --gpus all jinaai/dalle-flow
on aws g5.xlarge with Deep Learning AMI (Amazon Linux 2) Version 61.2
I have done lot of experiment fixing many small problems but now I'm not able to figure out how to go forward. In any case thanks a lot for this wonderful project
SERVER: Done. 0:0:0 device count: 1 DEBUG dalle/rep-0@12 start listening on 0.0.0.0:60127 [05/23/22 08:25:29] DEBUG dalle/rep-0@ 1 ready and listening [05/23/22 08:25:29]
╭───── 🎉 Flow is ready to serve! ──────╮ │ 🔗 Protocol GRPC │ │ 🏠 Local 0.0.0.0:51005 │ │ 🔒 Private 172.17.0.2:51005 │ │ 🌍 Public 54.161.222.71:51005 │ ╰───────────────────────────────────────╯
DEBUG gateway/rep-0@18 GRPC call failed with code [05/23/22 08:26:08]
StatusCode.UNAVAILABLE, retry attempt 1/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 1/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 2/3. Trying
next replica, if available.
DEBUG dalle/rep-0@12 recv DataRequest at / with id: [05/23/22 08:26:08]
3dfaebf6ef3e49a3977dd7dfe9eb6b27
DEBUG gateway/rep-0@18 GRPC call failed with code
StatusCode.UNAVAILABLE, retry attempt 2/3. Trying
next replica, if available.
DEBUG gateway/rep-0@18 GRPC call failed, retries exhausted
DEBUG gateway/rep-0@18 GRPC call failed, retries exhausted
ERROR gateway/rep-0@18 Error while getting responses from [05/23/22 08:26:08]
deployments: failed to connect to all addresses
|Gateway: Communication error with deployment at
address(es) 0.0.0.0:50029. Head or worker(s) may be
down.
CLIENT
ERROR GRPCClient@6813 gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses |Gateway: [05/23/2022 08:26:08 AM]
Communication error with deployment at address(es) 0.0.0.0:50029. Head or worker(s) may be down.
The ongoing request is terminated as the server is not available or closed already.
I have the same issue 😞 spent 3 days trying to fix it but with no luck..
Hey @giux78 , are u sure the docker is able to load all the Executors? Make sure to give enough resources, you can try first to have a lower number of replicas for each Executor
Seeing the same kind of issue on an AWS g5.xlarge instance running Deep Learning AMI (Ubuntu 18.04) Version 61.0. GPUs appear to be accessible from within Docker, CUDA versions all match (11.6), and rebuilding with various configuration changes has no affect.
@JoanFM Care to explain what you mean in more detail? The g5.xlarge should have enough resources, so if there's something else that needs to be configured, it's not clear from the README.
Edit: I'll also add that I've tried this with Amazon Linux 2 Deep Learning AMIs without luck as well. I wasn't able to take them as far as the Ubuntu image before running into issues, though, so I figured the Ubuntu image is more suitable.
Edit: Although it may not be useful for troubleshooting, I want to also add that installing DALLE-Flow directly on the server has also been a failure. This is true for both Ubuntu and Amazon Linux 2 AMIs, as well as p2 instances.
I'm no deep learning guru, but I've never had this much issue with deep learning in AWS/Docker before, so imagine there are some specifics that could be pointed out to make this process easier. Perhaps someone who has successfully set this up in AWS can update the docs with some more specific instructions?
Did you try building docker and run it via docker container? I just rebuild and run without any issue.
https://github.com/jina-ai/dalle-flow#run-in-docker
git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow
docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .
docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow
@hanxiao
Yup, tried it a number of times, double checking that I'm doing everything correctly. I can get it to build and run fine, and the GPU appears to be accessible from within the running container (i.e., by using nvidia-smi
in the container). But every time I try to communicate with it from a client (i.e., following the Google Colab example), I'd get some variation of:
CLIENT ERROR GRPCClient@6813 gRPC error: StatusCode.UNAVAILABLE failed to connect to all addresses |Gateway: [05/23/2022 08:26:08 AM] Communication error with deployment at address(es) 0.0.0.0:50029. Head or worker(s) may be down. The ongoing request is terminated as the server is not available or closed already.
(like the original issue creator). I noticed that the address it would provide would be different on different runs, but if I recall correctly, the other addresses would eventually appear before the whole thing quits. And just to be clear: it gets to a point where it appears to be running correctly, i.e., presenting 🎉 Flow is ready to serve!
and showing the URLs. It's just as soon as I make a request to it does it show these errors. I wouldn't be surprised if there's some network error I'm missing, but I don't think there's much to do on my end other than exposing the 51005
port.
I'm sure I'm doing something incorrectly, but I've gone through the process of spinning up new an EC2 instance with a new EBS volume, building the Docker image, running it and getting an error, troubleshooting, and retrying with some small variation at least a dozen times (with different instance types, AMIs, EBS options, etc.), so I've exhausted my capacity to troubleshoot this thing further with regards to the infrastructure. I'd be willing to troubleshoot what's happening in the container to cause these errors instead, but without some specific direction I wouldn't have the time to open that can of worms.
Is there any more details you can provide on the setups you've had success with? I saw in the other issue that you had success with a p2.8xlarge
instance (which I believe I tried as well), so knowing the details of such a setup would probably point out what I'm doing wrong. I think I've tried every Deep Learning/GPU AMI available for both Amazon Linux 2 and Ubuntu (with varying degree of success), but if you're using a different one, please let me know.
closing for now as we are trying to provide an auto-build docker image in next few hours. feel free to open the issue if the new image still doesn't work.

@AntonyLeons from the log the start of the service is successful, and it looks like receiving requests, handling requests, done requests all went well. Did you get the returned result on the client side?
I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed.

I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'.

And, there is no indication of an error message from the server side.

Yep, this is the error I got so this is caused by an out of memory issue but this is system memory not vram. This happen to me with 16GB allocated. 24gb seems to work for me, still complains about out of memory though.
From: Mohammed H. Alsayegh @.> Sent: Friday, June 24, 2022 9:59:57 PM To: jina-ai/dalle-flow @.> Cc: Antony Leons @.>; Mention @.> Subject: Re: [jina-ai/dalle-flow] Docker error (Issue #23)
I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed.
I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'.
And, there is no indication of an error message from the server side.
— Reply to this email directly, view it on GitHubhttps://github.com/jina-ai/dalle-flow/issues/23#issuecomment-1165934318, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF4NQZPQA6357VMIZOR6KMLVQYOU3ANCNFSM5WVA3D5A. You are receiving this because you were mentioned.Message ID: @.***>
Yep, this is the error I got so this is caused by an out of memory issue but this is system memory not vram. This happen to me with 16GB allocated. 24gb seems to work for me, still complains about out of memory though. … ________________________________ From: Mohammed H. Alsayegh @.> Sent: Friday, June 24, 2022 9:59:57 PM To: jina-ai/dalle-flow @.> Cc: Antony Leons @.>; Mention @.> Subject: Re: [jina-ai/dalle-flow] Docker error (Issue #23) I run Docker WSL2. There is an issue from the client side with the message the server is not available or already closed. [image]https://user-images.githubusercontent.com/40126750/175666510-088169e2-7f8c-4e4b-8b19-6490b33e8978.png I don't have any issue with connecting to 'grpc://dalle-flow.jina.ai:51005'. [image]https://user-images.githubusercontent.com/40126750/175666140-be353e2a-3ef6-4e2d-94d7-169634cd6f54.png And, there is no indication of an error message from the server side. [image]https://user-images.githubusercontent.com/40126750/175666010-12897ce2-662c-4fef-b409-981bce1eefb0.png — Reply to this email directly, view it on GitHub<#23 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF4NQZPQA6357VMIZOR6KMLVQYOU3ANCNFSM5WVA3D5A. You are receiving this because you were mentioned.Message ID: @.***>
I am running it on an M40 with 24GB VRAM as secondary graphic card and 128GB system memory. That should be sufficient as no other tasks are taking place.
I'm running it in WDDM mode, and it appears correctly on the sub-system

I think it was a mistake on my part. Under Python windows, I had set the local IP to 0.0.0.0, which does not echo as a local address and result with transmit failed. By changing it to 127.0.0.1, it works.

It took 6 minutes with M40

using grpc://127.0.0.1:51005
instead of grpcs://
can also help in resolving network protocol related errors.
I believe this issue has been resolved. Feel free to reopen if the problem occurs again.