hgong-snap

Results 18 comments of hgong-snap
trafficstars

not sure if there's some corner case [here](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L272). `strided_slice` nodes are added to [outputs_to_values](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L218) and set `progress=True`, then in here we delete `strided_slice` nodes [here](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L272). Then next loop starts because...

Another example: similar thing, worker `W-9000-easyocr` died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning). If I ssh into the...

With that said, 1. I can't know if a worker is in die stuck state or not. 2. Even if I know a worker is died, I can't know which...

Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or...

Hi @msaroufim here's mar file and config file: mar file: https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing config.properties: ``` service_envelope=body grpc_inference_port=17100 grpc_management_port=17101 inference_address=http://0.0.0.0:17200 management_address=http://0.0.0.0:17201 metrics_address=http://127.0.0.1:17202 default_workers_per_model=2 install_py_dep_per_model=true load_models=all # async_logging=true job_queue_size=500 default_response_timeout=5 unregister_model_timeout=5 models={\ "easyocr": {\...

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

Hi @lxning @msaroufim , thanks for the quick fix. unfortunately https://github.com/pytorch/serve/pull/1854 seems not fully mitigate the issue. I build the image with [latest master](https://github.com/pytorch/serve/commit/9a9c2f94020b5c7a2c05b4ec2450cabae5872703) with `./build_image.sh -g` [[reference](https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L40)]and then deployed....

we currently implemented a workaround for GKE that if there's no worker available just restart the pod(restart torchserve). as you can see this error can happen very often(8pm-11pm). #worker 2->1->0->restart->2->1->0,...

@msaroufim @lxning I can successfully repro it in my local with following setup ### setup - in `config.properties`, I put min/max worker=2 - start torchserve in local, waiting for it's...

@lxning I updated my script to accept another parameter `--sleep_time` so that it can configure how much time client should wait/sleep between requests. new client script: ```python import sys import...