hgong-snap comments

Results 18 comments of


                                            hgong-snap

trafficstars

Model conversion infinite loop in tf_utils.compute_const_folding_using_tf

not sure if there's some corner case [here](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L272). `strided_slice` nodes are added to [outputs_to_values](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L218) and set `progress=True`, then in here we delete `strided_slice` nodes [here](https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/tf_utils.py#L272). Then next loop starts because...

Worker thread stuck in die state

Another example: similar thing, worker `W-9000-easyocr` died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning). If I ssh into the...

Worker thread stuck in die state

With that said, 1. I can't know if a worker is in die stuck state or not. 2. Even if I know a worker is died, I can't know which...

Worker thread stuck in die state

Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or...

Worker thread stuck in die state

Hi @msaroufim here's mar file and config file: mar file: https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing config.properties: ``` service_envelope=body grpc_inference_port=17100 grpc_management_port=17101 inference_address=http://0.0.0.0:17200 management_address=http://0.0.0.0:17201 metrics_address=http://127.0.0.1:17202 default_workers_per_model=2 install_py_dep_per_model=true load_models=all # async_logging=true job_queue_size=500 default_response_timeout=5 unregister_model_timeout=5 models={\ "easyocr": {\...

Worker thread stuck in die state

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

Worker thread stuck in die state

Hi @lxning @msaroufim , thanks for the quick fix. unfortunately https://github.com/pytorch/serve/pull/1854 seems not fully mitigate the issue. I build the image with [latest master](https://github.com/pytorch/serve/commit/9a9c2f94020b5c7a2c05b4ec2450cabae5872703) with `./build_image.sh -g` [[reference](https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L40)]and then deployed....