clip-as-service icon indicating copy to clipboard operation
clip-as-service copied to clipboard

Client hangs .. python server

Open nstfk opened this issue 5 years ago • 12 comments

Prerequisites

Please fill in by replacing [ ] with [x].

System information Running on google colab

Description

I'm using this command to start the server: I am using python the start the server as instructed in the readme

from bert_serving.server.helper import get_args_parser
from bert_serving.server import BertServer
args = get_args_parser().parse_args(['-model_dir', '/content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/',
                                     '-port', '5555',
                                     '-port_out', '5556',
                                     '-max_seq_len', 'NONE',
                                     '-mask_cls_sep',
                                     '-num_worker','
      
     server = BertServer(args)
     server.start()

and this is the response I get:

I:VENTILATOR:[__i:__i: 66]:freeze, optimize and export graph, could take a while...
I:GRAPHOPT:[gra:opt: 52]:model config: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_config.json
I:GRAPHOPT:[gra:opt: 55]:checkpoint: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_model.ckpt
I:GRAPHOPT:[gra:opt: 59]:build graph...
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: [Errno 10] No child processes.  joblib will operate in serial mode
  warnings.warn('%s.  joblib will operate in serial mode' % (e,))

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

I:GRAPHOPT:[gra:opt:128]:load parameters from checkpoint...
I:GRAPHOPT:[gra:opt:132]:optimize...
I:GRAPHOPT:[gra:opt:140]:freeze...
I:GRAPHOPT:[gra:opt:145]:write graph to a tmp file: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:__i: 74]:optimized graph is stored at: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:_ru:128]:bind all sockets
I:VENTILATOR:[__i:_ru:132]:open 8 ventilator-worker sockets
I:VENTILATOR:[__i:_ru:135]:start the sink
I:SINK:[__i:_ru:303]:ready 

`

and calling the server via:

from bert_serving.client import BertClient
bc = BertClient(port=18888, port_out=18889, timeout=10000)
bc.encode(['First do it', 'then do it right', 'then do it better'])

Nothing happens, it hangs there forever!

nstfk avatar Mar 30 '19 21:03 nstfk

Observing similar issue on AWS p3 instance. The service freezes after receiving a request. Apparently the service hangs on some threading issue..

I:VENTILATOR:[__i:_ru:216]:terminated!
Traceback (most recent call last):
  File "/home/hadoop/venv/bin/bert-serving-start", line 10, in <module>
    sys.exit(main())
  File "/home/hadoop/venv/local/lib/python3.6/dist-packages/bert_serving/server/cli/__init__.py", line 5, in main
    server.join()
  File "/usr/lib64/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Still haven't figure out why, but apparently this problem is only transient and only happens at start-up time. If the first encoding task goes through then everything else goes through.

wliu-sift avatar Apr 06 '19 15:04 wliu-sift

I'm having the same problem, but I can't make the first encoding task go through. I'm looking to use Colaboratory as a one time thing for this project, just to generate a large amount of sentence embeddings since I have no access to a GPU.

I have tried setting the ignore_all_checks=True parameter when starting the client, but that doesn't work either (it gets stuck when i try to .encode).

Any help would be really appreciated! Thanks.

FredericoCoelhoNunes avatar Jun 06 '19 17:06 FredericoCoelhoNunes

I've noticed the same issue, was quite easy to recreate on a google cloud instance; here's the details I was using which may help reproduce:

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): os: Linux os kernel version: #1 SMP Debian 4.9.168-1+deb9u2 (2019-05-13) os release version: 4.9.0-9-amd64 os platform: Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 linux distribution: ('debian', '9.9', '') linux os distribution: ('debian', '9.9', '')

  • TensorFlow installed from (source or binary): binary (via pip) pip installs of: numpy (1.16.4) protobuf (3.8.0) tensorflow (1.13.1) tensorflow-estimator (1.13.0)

  • TensorFlow version: 1.13.1

  • Python version: 3.5

  • bert-as-service version: 1.9.1

  • GPU model and memory:

  • CPU model and memory:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16130MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

After I start the instance, I send it a few queries, all further client requests hang. I suspect a race condition in the server.

iislucas avatar Jun 14 '19 12:06 iislucas

I tried v1.8.1 and found the same issue; that after a few queries, and a minute or so, the server becomes unresponsive and the old client and new clients never get the responses from the encode function.

The last logs from the server look like so:

I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:_ru:468]:job done       size: (1, 768)  client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back  size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
^[[AI:VENTILATOR:[__i:_ru:164]:new encode request       req id: 11      size: 1 client: b'cb057fa0-40e5-46a1-8014-4
f7e5f23d218'
I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'

Occasionally, on the client side, instead of just hanging, I see errors like this (the client is running locally using python 3.7, while server is using python 3.5):

Traceback (most recent call last):
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2328, in __call__
    return self.wsgi_app(environ, start_response)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2314, in wsgi_app
    response = self.handle_exception(e)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1760, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "Users/iislucas/index-server/index_server.py", line 42, in embedding
    embedding = bc.encode([obj['text']])
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 202, in arg_wrapper
    return func(self, *args, **kwargs)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 283, in encode
    r = self._recv_ndarray(req_id)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 166, in _recv_ndarray
    request_id, response = self._recv(wait_for_req_id)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 160, in _recv
    raise e
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 150, in _recv
    request_id = int(response[-1])
ValueError: invalid literal for int() with base 10: b'{"shape":[1,768],"dtype":"float32","tokens":""}

And the corresponding log on the server is:

I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:_ru:468]:job done       size: (1, 768)  client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back  size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'

Are requests somehow getting out of sync?

iislucas avatar Jun 14 '19 13:06 iislucas

I have a similar issue. If I terminate the server, I also have the same exception on threading. As for the sample code to trigger this hang, if you cancel encode by Ctrl+C and rerun it, you can see the SINK sent back the result of first request, but the new encode hangs again.

4everlove avatar Jun 21 '19 05:06 4everlove

Could you folks check which version of libzmq your machine is using while encountering this issue?

4everlove avatar Jun 21 '19 16:06 4everlove

I'm having the same issue. Was anybody able to find a solution to this?

asankasan avatar Mar 03 '20 10:03 asankasan

My workaround is wait for the server to fully booted up and add some delay ~10s. The issue never appears again.

wliu-sift avatar Mar 03 '20 16:03 wliu-sift

@wliu-sift Thanks for the response. To where do you add the delay? To the client?

asankasan avatar Mar 04 '20 04:03 asankasan

Hi Team, I am also getting same issue - do you have any update on this?

mjangid avatar May 05 '20 15:05 mjangid

same issue here. wondering if it's keeping sockets open. looks like the socket list grows pretty quick and doesnt come back down. here's the health check from right before it went down:

{"ckpt_name":"bert_model.ckpt","client":"1b8cb6cd-3b2f-4ade-80f8-e8eb01298c14","config_name":"bert_config.json","cors":"*","cpu":false,"device_map":[],"fixed_embed_length":false,"fp16":false,"gpu_memory_fraction":0.5,"graph_tmp_dir":null,"http_max_connect":10,"http_port":8080,"mask_cls_sep":false,"max_batch_size":256,"max_seq_len":25,"model_dir":"models/uncased_L-12_H-768_A-12","num_concurrent_socket":30,"num_process":17,"num_worker":15,"pooling_layer":[-2],"pooling_strategy":2,"port":5555,"port_out":5556,"prefetch_size":10,"priority_batch_size":16,"python_version":"3.6.3 (default, Jul 9 2019, 08:50:08) \n[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)]","pyzmq_version":"19.0.1","server_current_time":"2020-05-18 13:32:43.888630","server_start_time":"2020-05-18 13:15:37.862617","server_version":"1.8.9","show_tokens_to_client":false,"statistic":{"avg_last_two_interval":167.3810899715,"avg_request_per_client":12.0,"avg_request_per_second":0.034329692637323224,"avg_size_per_request":2.0,"max_last_two_interval":517.0026954719999,"max_request_per_client":12,"max_request_per_second":0.10548447649600348,"max_size_per_request":3,"min_last_two_interval":9.480067903999952,"min_request_per_client":12,"min_request_per_second":0.0019342258923564133,"min_size_per_request":1,"num_active_client":0,"num_data_request":4,"num_max_last_two_interval":1,"num_max_request_per_client":1,"num_max_request_per_second":1,"num_max_size_per_request":1,"num_min_last_two_interval":1,"num_min_request_per_client":1,"num_min_request_per_second":1,"num_min_size_per_request":1,"num_sys_request":8,"num_total_client":1,"num_total_request":12,"num_total_seq":8},"status":200,"tensorflow_version":["1","11","0"],"tuned_model_dir":null,"ventilator -> worker":["ipc://tmp3fYAAK/socket","ipc://tmpG06X52/socket","ipc://tmpAZqmBl/socket","ipc://tmpYIBL6D/socket","ipc://tmp8EubCW/socket","ipc://tmpzm5B7e/socket","ipc://tmpZmr3Cx/socket","ipc://tmpAWuv8P/socket","ipc://tmpJWeYD8/socket","ipc://tmpPGVr9q/socket","ipc://tmpLelWEJ/socket","ipc://tmpBTtra2/socket","ipc://tmpfmwXFk/socket","ipc://tmpFQ0ubD/socket","ipc://tmpeKZ3GV/socket","ipc://tmp0nSDce/socket","ipc://tmpuJCeIw/socket","ipc://tmpYDhQdP/socket","ipc://tmppWMsJ7/socket","ipc://tmpQO75eq/socket","ipc://tmpLgnKKI/socket","ipc://tmpAKzpg1/socket","ipc://tmpDHC5Lj/socket","ipc://tmphWYMhC/socket","ipc://tmpPjGvNU/socket","ipc://tmpTikfjd/socket","ipc://tmpYdRZOv/socket","ipc://tmpZIfLkO/socket","ipc://tmpUovxQ6/socket","ipc://tmp8TDkmp/socket"],"ventilator <-> sink":"ipc://tmpcsCe5r/socket","verbose":false,"worker -> sink":"ipc://tmpKaNt2G/socket","xla":false,"zmq_version":"4.3.2"}

bigrig2212 avatar May 18 '20 13:05 bigrig2212

I had the same issue. It is perplexing that if I open 2 client programs only one of them will get stuck. If I send query1 from client1 and it gets stuck. Then I send query2 from client2, client1 will receive the result and client2 get stuck. Very similar to what 4everlove described here. [(https://github.com/hanxiao/bert-as-service/issues/387)]

Yuiard avatar Apr 16 '21 13:04 Yuiard