clip-as-service
clip-as-service copied to clipboard
Client hangs .. python server
Prerequisites
Please fill in by replacing
[ ]
with[x]
.
- [ yes] Are you running the latest
bert-as-service
? - [ yes] Did you follow the installation and the usage instructions in
README.md
? - [ yes] Did you perform a cursory search on existing issues?
System information Running on google colab
Description
I'm using this command to start the server: I am using python the start the server as instructed in the readme
from bert_serving.server.helper import get_args_parser
from bert_serving.server import BertServer
args = get_args_parser().parse_args(['-model_dir', '/content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/',
'-port', '5555',
'-port_out', '5556',
'-max_seq_len', 'NONE',
'-mask_cls_sep',
'-num_worker','
server = BertServer(args)
server.start()
and this is the response I get:
I:VENTILATOR:[__i:__i: 66]:freeze, optimize and export graph, could take a while...
I:GRAPHOPT:[gra:opt: 52]:model config: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_config.json
I:GRAPHOPT:[gra:opt: 55]:checkpoint: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_model.ckpt
I:GRAPHOPT:[gra:opt: 59]:build graph...
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: [Errno 10] No child processes. joblib will operate in serial mode
warnings.warn('%s. joblib will operate in serial mode' % (e,))
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
I:GRAPHOPT:[gra:opt:128]:load parameters from checkpoint...
I:GRAPHOPT:[gra:opt:132]:optimize...
I:GRAPHOPT:[gra:opt:140]:freeze...
I:GRAPHOPT:[gra:opt:145]:write graph to a tmp file: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:__i: 74]:optimized graph is stored at: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:_ru:128]:bind all sockets
I:VENTILATOR:[__i:_ru:132]:open 8 ventilator-worker sockets
I:VENTILATOR:[__i:_ru:135]:start the sink
I:SINK:[__i:_ru:303]:ready
`
and calling the server via:
from bert_serving.client import BertClient
bc = BertClient(port=18888, port_out=18889, timeout=10000)
bc.encode(['First do it', 'then do it right', 'then do it better'])
Nothing happens, it hangs there forever!
Observing similar issue on AWS p3 instance. The service freezes after receiving a request. Apparently the service hangs on some threading issue..
I:VENTILATOR:[__i:_ru:216]:terminated!
Traceback (most recent call last):
File "/home/hadoop/venv/bin/bert-serving-start", line 10, in <module>
sys.exit(main())
File "/home/hadoop/venv/local/lib/python3.6/dist-packages/bert_serving/server/cli/__init__.py", line 5, in main
server.join()
File "/usr/lib64/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Still haven't figure out why, but apparently this problem is only transient and only happens at start-up time. If the first encoding task goes through then everything else goes through.
I'm having the same problem, but I can't make the first encoding task go through. I'm looking to use Colaboratory as a one time thing for this project, just to generate a large amount of sentence embeddings since I have no access to a GPU.
I have tried setting the ignore_all_checks=True parameter when starting the client, but that doesn't work either (it gets stuck when i try to .encode).
Any help would be really appreciated! Thanks.
I've noticed the same issue, was quite easy to recreate on a google cloud instance; here's the details I was using which may help reproduce:
- [ x ] Are you running the latest
bert-as-service
? (v 1.9.1) - [ x ] Did you follow the installation and the usage instructions in
README.md
? - [ x ] Did you check the FAQ list in
README.md
? - [ x ] Did you perform a cursory search on existing issues?
System information
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): os: Linux os kernel version: #1 SMP Debian 4.9.168-1+deb9u2 (2019-05-13) os release version: 4.9.0-9-amd64 os platform: Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 linux distribution: ('debian', '9.9', '') linux os distribution: ('debian', '9.9', '')
-
TensorFlow installed from (source or binary): binary (via pip) pip installs of: numpy (1.16.4) protobuf (3.8.0) tensorflow (1.13.1) tensorflow-estimator (1.13.0)
-
TensorFlow version: 1.13.1
-
Python version: 3.5
-
bert-as-service
version: 1.9.1 -
GPU model and memory:
-
CPU model and memory:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 37C P0 38W / 300W | 0MiB / 16130MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
After I start the instance, I send it a few queries, all further client requests hang. I suspect a race condition in the server.
I tried v1.8.1 and found the same issue; that after a few queries, and a minute or so, the server becomes unresponsive and the old client and new clients never get the responses from the encode function.
The last logs from the server look like so:
I:SINK:[__i:_ru:312]:job register size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:gen:492]:new job socket: 0 size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:_ru:468]:job done size: (1, 768) client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
^[[AI:VENTILATOR:[__i:_ru:164]:new encode request req id: 11 size: 1 client: b'cb057fa0-40e5-46a1-8014-4
f7e5f23d218'
I:SINK:[__i:_ru:312]:job register size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:WORKER-0:[__i:gen:492]:new job socket: 0 size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
Occasionally, on the client side, instead of just hanging, I see errors like this (the client is running locally using python 3.7, while server is using python 3.5):
Traceback (most recent call last):
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2328, in __call__
return self.wsgi_app(environ, start_response)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2314, in wsgi_app
response = self.handle_exception(e)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1760, in handle_exception
reraise(exc_type, exc_value, tb)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
raise value
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2311, in wsgi_app
response = self.full_dispatch_request()
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1834, in full_dispatch_request
rv = self.handle_user_exception(e)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1737, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
raise value
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1832, in full_dispatch_request
rv = self.dispatch_request()
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1818, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "Users/iislucas/index-server/index_server.py", line 42, in embedding
embedding = bc.encode([obj['text']])
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 202, in arg_wrapper
return func(self, *args, **kwargs)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 283, in encode
r = self._recv_ndarray(req_id)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 166, in _recv_ndarray
request_id, response = self._recv(wait_for_req_id)
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 160, in _recv
raise e
File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 150, in _recv
request_id = int(response[-1])
ValueError: invalid literal for int() with base 10: b'{"shape":[1,768],"dtype":"float32","tokens":""}
And the corresponding log on the server is:
I:SINK:[__i:_ru:312]:job register size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:gen:492]:new job socket: 0 size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:_ru:468]:job done size: (1, 768) client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
Are requests somehow getting out of sync?
I have a similar issue. If I terminate the server, I also have the same exception on threading. As for the sample code to trigger this hang, if you cancel encode by Ctrl+C and rerun it, you can see the SINK sent back the result of first request, but the new encode hangs again.
Could you folks check which version of libzmq your machine is using while encountering this issue?
I'm having the same issue. Was anybody able to find a solution to this?
My workaround is wait for the server to fully booted up and add some delay ~10s. The issue never appears again.
@wliu-sift Thanks for the response. To where do you add the delay? To the client?
Hi Team, I am also getting same issue - do you have any update on this?
same issue here. wondering if it's keeping sockets open. looks like the socket list grows pretty quick and doesnt come back down. here's the health check from right before it went down:
{"ckpt_name":"bert_model.ckpt","client":"1b8cb6cd-3b2f-4ade-80f8-e8eb01298c14","config_name":"bert_config.json","cors":"*","cpu":false,"device_map":[],"fixed_embed_length":false,"fp16":false,"gpu_memory_fraction":0.5,"graph_tmp_dir":null,"http_max_connect":10,"http_port":8080,"mask_cls_sep":false,"max_batch_size":256,"max_seq_len":25,"model_dir":"models/uncased_L-12_H-768_A-12","num_concurrent_socket":30,"num_process":17,"num_worker":15,"pooling_layer":[-2],"pooling_strategy":2,"port":5555,"port_out":5556,"prefetch_size":10,"priority_batch_size":16,"python_version":"3.6.3 (default, Jul 9 2019, 08:50:08) \n[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)]","pyzmq_version":"19.0.1","server_current_time":"2020-05-18 13:32:43.888630","server_start_time":"2020-05-18 13:15:37.862617","server_version":"1.8.9","show_tokens_to_client":false,"statistic":{"avg_last_two_interval":167.3810899715,"avg_request_per_client":12.0,"avg_request_per_second":0.034329692637323224,"avg_size_per_request":2.0,"max_last_two_interval":517.0026954719999,"max_request_per_client":12,"max_request_per_second":0.10548447649600348,"max_size_per_request":3,"min_last_two_interval":9.480067903999952,"min_request_per_client":12,"min_request_per_second":0.0019342258923564133,"min_size_per_request":1,"num_active_client":0,"num_data_request":4,"num_max_last_two_interval":1,"num_max_request_per_client":1,"num_max_request_per_second":1,"num_max_size_per_request":1,"num_min_last_two_interval":1,"num_min_request_per_client":1,"num_min_request_per_second":1,"num_min_size_per_request":1,"num_sys_request":8,"num_total_client":1,"num_total_request":12,"num_total_seq":8},"status":200,"tensorflow_version":["1","11","0"],"tuned_model_dir":null,"ventilator -> worker":["ipc://tmp3fYAAK/socket","ipc://tmpG06X52/socket","ipc://tmpAZqmBl/socket","ipc://tmpYIBL6D/socket","ipc://tmp8EubCW/socket","ipc://tmpzm5B7e/socket","ipc://tmpZmr3Cx/socket","ipc://tmpAWuv8P/socket","ipc://tmpJWeYD8/socket","ipc://tmpPGVr9q/socket","ipc://tmpLelWEJ/socket","ipc://tmpBTtra2/socket","ipc://tmpfmwXFk/socket","ipc://tmpFQ0ubD/socket","ipc://tmpeKZ3GV/socket","ipc://tmp0nSDce/socket","ipc://tmpuJCeIw/socket","ipc://tmpYDhQdP/socket","ipc://tmppWMsJ7/socket","ipc://tmpQO75eq/socket","ipc://tmpLgnKKI/socket","ipc://tmpAKzpg1/socket","ipc://tmpDHC5Lj/socket","ipc://tmphWYMhC/socket","ipc://tmpPjGvNU/socket","ipc://tmpTikfjd/socket","ipc://tmpYdRZOv/socket","ipc://tmpZIfLkO/socket","ipc://tmpUovxQ6/socket","ipc://tmp8TDkmp/socket"],"ventilator <-> sink":"ipc://tmpcsCe5r/socket","verbose":false,"worker -> sink":"ipc://tmpKaNt2G/socket","xla":false,"zmq_version":"4.3.2"}
I had the same issue. It is perplexing that if I open 2 client programs only one of them will get stuck. If I send query1 from client1 and it gets stuck. Then I send query2 from client2, client1 will receive the result and client2 get stuck. Very similar to what 4everlove described here. [(https://github.com/hanxiao/bert-as-service/issues/387)]