zhaohongbo issues

Results 7 issues of


                                            zhaohongbo

How to monitor heterogeneous devices on the polyaxon interface, such as gpu, fpga?

How to monitor heterogeneous devices on the polyaxon webpage, such as gpu, fpga? Monitoring information includes: whether it is being used, how many devices there are, etc. thank you

question

stale

openvinobackend inference only support synchronous mode? why not asynchronous mode

I looked at the code and found that OpenVino Backend calls Infer_request_.infer (), which is synchronous mode, can we support asyncchronous mode?

Accumulate inference time with an ensemble model is way slower than the slowest individual

**Description** I met the same problem at https://github.com/triton-inference-server/server/issues/3245 **Triton Information** 22.05 Are you using the Triton container or did you build it yourself? Triton container Here are my results, all...

question

batch nomalization

Hi， in resnet50 has batch nomalization layer, but I can not find bn kernel in device client, why? thank you

CopyPackedKernel is taking too long, and how to optimize it

## Description I have a model that uses a slice operator for feature crossing, but it turns out that the slice operator calls the CopyPackedKernel API, and it consumes a...

Runtime: Performance

Release: 7.x

triaged

llama 7B model can not use q2_k

I build llama.cpp from master, and convert [https://huggingface.co/decapoda-research/llama-7b-hf](url) model to ggml I use that command: ```shell CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q4_k_s.bin 3 ``` It works, and I can get the quantized...

The A100 test performance did not match the official test results

I install vllm with ```shell pip install vllm ``` then use that command start server ```shell CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.api_server --model llama-7b-hf/ --swap-space 16 --disable-log-requests --port 9009 ``` benchmark with...