clip-as-service
clip-as-service copied to clipboard
clip_server lost connection after running for a while
It can start normally, but after running for a while, it lost connections.
DEBUG clip_t/rep-0@6022 start listening on 0.0.0.0:54630
DEBUG clip_t/rep-0@6019 ready and listening [10/28/22 23:32:24]
────────────────────────────────────── 🎉 Flow is ready to serve! ───────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│ ⛓ Protocol GRPC │
│ 🏠 Local 0.0.0.0:51000 │
│ 🔒 Private 192.168.31.58:51000 │
╰──────────────────────────────────────────╯
DEBUG Flow@6019 2 Deployments (i.e. 2 Pods) are running in this Flow [10/28/22 23:32:24]
DEBUG clip_t/rep-0@6022 got an endpoint discovery request [10/28/22 23:37:35]
DEBUG clip_t/rep-0@6022 recv DataRequest at /rank with id: 644c5f98f0034283bf9334718ec4295c
UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:178.) (raised from /opt/homebrew/lib/python3.9/site-packages/torchvision/transforms/functional.py:150)
DEBUG gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 1/3. Trying next replica, if [10/28/22 23:37:35]
available.
DEBUG gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 2/3. Trying next replica, if
available.
DEBUG gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 3/3. Trying next replica, if
available.
DEBUG gateway/rep-0/GatewayRuntime@6023 GRPC call failed, retries exhausted
DEBUG gateway/rep-0/GatewayRuntime@6023 resetting connection to 0.0.0.0:54630
ERROR gateway/rep-0/GatewayRuntime@6023 Error while getting responses from deployments: failed to connect to all addresses; last error:
UNKNOWN: Failed to connect to remote host: Connection refused |Gateway: Communication error with deployment clip_t at address(es)
{'0.0.0.0:54630'}. Head or worker(s) may be down.
pip3 show clip_server
Name: clip-server
Version: 0.8.0
Summary: Embed images and sentences into fixed-length vectors via CLIP
Home-page: https://github.com/jina-ai/clip-as-service
Author: Jina AI
Author-email: [email protected]
License: Apache 2.0
Location: /opt/homebrew/lib/python3.9/site-packages
Requires: ftfy, jina, open-clip-torch, prometheus-client, regex, torch, torchvision
Required-by:
host on : Macbook Pro M1
Because my Docker cluster cannot be connected to the Internet, I downloaded ViT-B-32.pt from the local image and then uploaded it to the Docker cluster. However, the cluster container cannot find the option to continue downloading ViT, but the relative location in the cluster contains ViT
My problem is solved. The main reason is that the program cannot obtain the model from the root path with the model because of the change of the environment variable "root"
@learningpro How do you start the server? via local CLI python -m clip_server
or k8s?
@learningpro Could you provide more details on this problem? Like the YAML file you use, steps to reproduce, etc. Thanks!
@numb3r3 @ZiniuYu facing the same issue.
Machine: Mac M1 Pro
Command
python -m clip_server
❯ python3 -m clip_server search-app 12:27:07
────────────────────────────────────────────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│ ⛓ Protocol GRPC │
│ 🏠 Local 0.0.0.0:51000 │
│ 🔒 Private 192.168.1.47:51000 │
│ 🌍 Public None:51000 │
╰──────────────────────────────────────────╯
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:31:35]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:31:39]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:32:15]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:32:21]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:32:23]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway: [11/12/22 12:35:01]
Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.
Client code
from clip_client import Client
client = Client('grpc://0.0.0.0:51000')
r = client.encode(['she smiled, with pain', 'https://clip-as-service.jina.ai/_static/favicon.png'])
print(r)
@kaushikb11, have u tried sending first only text and later only images?
yes, it worked the initial time with
r = client.encode(['she smiled, with pain'])
but not with two strings of text
r = client.encode(['she smiled, with pain', 'what is pain?'])
It failed with a single image as well
r = client.encode(['https://clip-as-service.jina.ai/_static/favicon.png'])
What's the output for r = client.encode(['she smiled, with pain'])
and r = client.encode(['she smiled, with pain', 'what is pain?'])
? I am wondering why they had different behaviors. @kaushikb11
Hi @kaushikb11 ,
What's your output of jina -vf
?
Can you also try export JINA_LOG_LEVEL=DEBUG
and rerun your code?
And what's more, what the pytorch version are you using? BTW, are you running clip_server under rosetta x86?
@ZiniuYu Here you go
jina -vf search-app 19:02:13
- jina 3.11.0
- docarray 0.18.1
- jcloud 0.0.36
- jina-hubble-sdk 0.22.2
- jina-proto 0.1.13
- protobuf 3.20.3
- proto-backend python
- grpcio 1.47.2
- pyyaml 6.0
- python 3.8.15
- platform Darwin
- platform-release 22.1.0
- platform-version Darwin Kernel Version 22.1.0: Sun Oct 9 20:15:09 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T6000
- architecture arm64
- processor arm
- uid 55969664184872
- session-id 94442b94-63e7-11ed-ad67-32e773f3b228
- uptime 2022-11-14T12:12:58.912197
- ci-vendor (unset)
- internal False
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
* JINA_LOCKS_ROOT (unset)
* JINA_K8S_ACCESS_MODES (unset)
* JINA_K8S_STORAGE_CLASS_NAME (unset)
* JINA_K8S_STORAGE_CAPACITY (unset)
@numb3r3 PyTorch versions
pip3 freeze | grep torch
open-clip-torch==2.4.1
torch==1.13.0
torchmetrics==0.10.2
torchvision==0.14.0
What's the output for r = client.encode(['she smiled, with pain']) and r = client.encode(['she smiled, with pain', 'what is pain?'])? I am wondering why they had different behaviors
@jemmyshin I have no idea. The first returned an embedding.
Let me know if I could help you with anything else. fyi: The system is Mac M1 Pro
@kaushikb11 The environment looks legit.
Can you also please rerun everything with export JINA_LOG_LEVEL=DEBUG
and paste the output here?
Traceback when I run the client
DEBUG GRPCClient@9945 connected to 0.0.0.0:51000 [11/14/22 12:50:55]
Traceback (most recent call last):
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 47, in _arg_wrapper
return func(*args, **kwargs)
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/clip_client/client.py", line 153, in _gather_result
results[r[:, 'id']][:, attribute] = r[:, attribute]
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/docarray/array/mixins/getitem.py", line 102, in __getitem__
elif isinstance(index[0], bool):
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "check.py", line 9, in <module>
r = client.encode(["She is in pain", "what's pain"])
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/clip_client/client.py", line 295, in encode
self._client.post(
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/mixin.py", line 271, in post
return run_async(
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/helper.py", line 1334, in run_async
return asyncio.run(func(*args, **kwargs))
File "/opt/homebrew/Cellar/[email protected]/3.8.15/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/homebrew/Cellar/[email protected]/3.8.15/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/mixin.py", line 262, in _get_results
async for resp in c._get_results(*args, **kwargs):
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/base/grpc.py", line 131, in _get_results
callback_exec(
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 83, in callback_exec
_safe_callback(on_done, continue_on_error, logger)(response)
File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 49, in _arg_wrapper
err_msg = f'uncaught exception in callback {func.__name__}(): {ex!r}'
AttributeError: 'functools.partial' object has no attribute '__name__'
Server side
python3 -m clip_server search-app 12:49:45
⠋ Waiting ... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/0 -:--:--DEBUG gateway/rep-0/GatewayRuntime@9744 adding connection for deployment clip_t/heads/0 to grpc://0.0.0.0:65282 [11/14/22 12:49:53]
DEBUG gateway/rep-0/GatewayRuntime@9744 start server bound to 0.0.0.0:51000
DEBUG gateway/rep-0@9729 ready and listening [11/14/22 12:49:53]
⠼ Waiting clip_t... ━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━ 2/3 0:00:03DEBUG clip_t/rep-0@9743 <clip_server.executors.clip_torch.CLIPEncoder object at 0x13f4d1dc0> is successfully loaded! [11/14/22 12:49:57]
DEBUG clip_t/rep-0@9743 start listening on 0.0.0.0:65282
DEBUG clip_t/rep-0@9729 ready and listening [11/14/22 12:49:57]
────────────────────────────────────────────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│ ⛓ Protocol GRPC │
│ 🏠 Local 0.0.0.0:51000 │
│ 🔒 Private 10.191.62.138:51000 │
│ 🌍 Public None:51000 │
╰──────────────────────────────────────────╯
DEBUG Flow@9729 2 Deployments (i.e. 2 Pods) are running in this Flow [11/14/22 12:49:57]
DEBUG clip_t/rep-0@9743 got an endpoint discovery request [11/14/22 12:50:22]
DEBUG clip_t/rep-0@9743 recv DataRequest at /encode with id: 4a0fa5aa31ca493e9f316474cb5909a7
DEBUG clip_t/rep-0@9743 recv DataRequest at /encode with id: 1842d6550a384061bea12795770d5cf9 [11/14/22 12:50:25]
DEBUG clip_t/rep-0@9743 recv DataRequest at /encode with id: 766e528658bf4b6f81b0f5ce96631d7c [11/14/22 12:50:55]
DEBUG gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 1/3. Trying next replica, if [11/14/22 12:50:55]
available.
DEBUG gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 2/3. Trying next replica, if
available.
DEBUG gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 3/3. Trying next replica, if
available.
DEBUG gateway/rep-0/GatewayRuntime@9744 GRPC call failed, retries exhausted
DEBUG gateway/rep-0/GatewayRuntime@9744 resetting connection to 0.0.0.0:65282
ERROR gateway/rep-0/GatewayRuntime@9744 Error while getting responses from deployments: failed to connect to all addresses |Gateway:
Communication error with deployment clip_t at address(es) {'0.0.0.0:65282'}. Head or worker(s) may be down.
@ZiniuYu
@kaushikb11 so far, we cannot reproduce your error on our side (exactly the same envs, including M1 Pro, jina, docarray, pytorch version). We guess this is an upstream issue related to pytorch
installation. We just need more time to verify this issue. Of course, any more feedbacks are welcome. I believe our community would also face this problem.
@numb3r3 Noted! Thanks. Do keep me updated if you have any progress.
Hello,
I am running with similar issues on different setups.
I am running clip as a service as well in GRPC mode. The clip-server is running in a docker container.
I have seen this issue on my development environment where the communication stops at some point. This time, it stopped at restart which I have not seen as often.
On my other environments running on kubernetes, every time this has happened I had to redeploy the containers to make them functional again. Do you have any clues as to what could be the source of this problem? Could it be related with management of sockets/communication channels? Is is possible that when peppering the service with too many queued requests it runs out of connections? Let me know how I can help with the troubleshooting.
Here is an error log upon restarting the container on Docker Desktop running on Windows:
Task exception was never retrieved
future: <Task finished name='Task-13' coro=<GatewayRequestHandler.handle_request.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1068, in task_wrapper
return await connection.send_discover_endpoint(
File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 377, in send_discover_endpoint
await self._init_stubs()
File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 353, in _init_stubs
available_services = await GrpcConnectionPool.get_available_services(
File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1390, in get_available_services
async for res in response:
File "/usr/local/lib/python3.9/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses
await self._raise_for_status()
File "/usr/local/lib/python3.9/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status
raise _create_rpc_error(await self.initial_metadata(), await
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1677095795.838017200","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1677095795.838016400","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/jina/serve/runtimes/gateway/request_handling.py", line 68, in gather_endpoints
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:35]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:36]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
raise err
File "/usr/local/lib/python3.9/site-packages/jina/serve/runtimes/gateway/request_handling.py", line 60, in gather_endpoints
endpoints = await asyncio.gather(*tasks_to_get_endpoints)
File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1082, in task_wrapper
raise error
jina.excepts.InternalNetworkError: failed to connect to all addresses |Gateway: Communication error with deployment at address(es) {'0.0.0.0:64294'}. Head or worker(s) may be down.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:38]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:41]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:43]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:45]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:50]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:53]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│ ⛓ Protocol GRPC │
│ 🏠 Local 0.0.0.0:9100 │
│ 🔒 Private 172.24.0.5:9100 │
│ 🌍 Public 23.233.181.148:9100 │
╰──────────────────────────────────────────╯
╭──────── 💎 Prometheus extension ─────────╮
│ 🔦 clip_t ...:9091 │
│ 🔦 gateway ...:9090 │
╰──────────────────────────────────────────╯
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:57:00]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:57:01]
responses from deployments: failed to connect to all
addresses |Gateway: Communication error with
deployment clip_t at address(es) {'0.0.0.0:64294'}.
Head or worker(s) may be down.
Hi @vincetrep
What's your jina version (jina -vf
)? Can you set the env JINA_LOG_LEVEL=debug
and see if it prints any more info?
One possible reason is that you are running out of computing resource. We recently fixed a issue in Jina Core that affects the health check latency when a flow is stressed by load in k8s, could you please upgrade to the latest jina and try again?
Could you also provide more details about how the communication stops if possible? Like it stops when you are sending large requests or just stops when the flow is idle, and etc. Anything that may help us debug or reproduce is welcomed.
Hi @ZiniuYu ,
apologies for the delay in the answer. The error occurs after a certain period of time running and I am sending batches of combinations of texts and images to clip as a service.
Running latest version of jina: 3.14.2.
Let's take the scenario where we're running out of computing resource. Should there be a recovery mechanism inside of the container to recover from that state? Via a retry mechanism or other?
At the moment, when this occurs, the container ends up in a state where it lost connectivity and doesn't recover from the state even if no more resources are consumed.
If you want to try to replicate, try a local setup on your machine and send a big batch of records like images to encode to get your container in an "resource exhausted state".