zhaohongbo
zhaohongbo
How to monitor heterogeneous devices on the polyaxon webpage, such as gpu, fpga? Monitoring information includes: whether it is being used, how many devices there are, etc. thank you
I looked at the code and found that OpenVino Backend calls Infer_request_.infer (), which is synchronous mode, can we support asyncchronous mode?
**Description** I met the same problem at https://github.com/triton-inference-server/server/issues/3245 **Triton Information** 22.05 Are you using the Triton container or did you build it yourself? Triton container Here are my results, all...
Hi, in resnet50 has batch nomalization layer, but I can not find bn kernel in device client, why? thank you
## Description I have a model that uses a slice operator for feature crossing, but it turns out that the slice operator calls the CopyPackedKernel API, and it consumes a...
I build llama.cpp from master, and convert [https://huggingface.co/decapoda-research/llama-7b-hf](url) model to ggml I use that command: ```shell CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q4_k_s.bin 3 ``` It works, and I can get the quantized...
I install vllm with ```shell pip install vllm ``` then use that command start server ```shell CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.api_server --model llama-7b-hf/ --swap-space 16 --disable-log-requests --port 9009 ``` benchmark with...