model_server icon indicating copy to clipboard operation
model_server copied to clipboard

OpenAI API completions endpoint - Not working as expected

Open anandnandagiri opened this issue 1 year ago • 8 comments

I have downloaded LLAMA 3.2 1B Model from Hugging face with optimum-cli

optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct llama3.2-1b/1

Below are files downloaded

image

Note: I manually removed openvino_detokenizer.bin, openvino_detokenizer.xml, openvino_tokenizer.xml, openvino_tokenizer.bin to ensure we have only 1 bin and 1 xml file in the version 1 folder

Run Model Server with below command ensuring window wsl path is given correct. Also parameter for Intel Iris GPU for docker

docker run --rm -it -v %cd%/ovmodels/llama3.2-1b:/models/llama3.2-1b --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --model_path /models/llama3.2-1b --model_name llama3.2-1b --rest_port 8000

I have run below command which worked perfect curl --request GET http://172.17.0.3:8000/v1/config

Below is output

{ "llama3.2-1b" : { "model_version_status": [ { "version": "1", "state": "AVAILABLE", "status": { "error_code": "OK", "error_message": "OK" } } ] }

But incase of below curl command for OpenAI API Completions did not worked as expected

curl http://172.17.0.3:8000/v3/completions
-H "Content-Type: application/json"
-d '{"model": "llama3.2-1b","prompt": "This is a test","stream": false }'

Giving Error {"error": "Model with requested name is not found"}

anandnandagiri avatar Oct 02 '24 00:10 anandnandagiri

Hello @anandnandagiri

You are trying to serve the the model directly, with no continuous batching pipeline. In such scenario the model is exposed for single inference via standard TFS/KServe APIs with no text generation loop. To fully utilize text generation use case via OpenAI completion API, please refer to Continuous Batching Demo. Just follow the steps and swap the model to llama 3.2 1b.

dkalinowski avatar Oct 03 '24 13:10 dkalinowski

Thank You @dkalinowski

anandnandagiri avatar Oct 03 '24 13:10 anandnandagiri

@dkalinowski

followed the steps in Continuous Batching Demo. Below are expectations I am getting. I don't see any GPU resource issue(Image attached for reference). I am testing on INTEL I7 11th Generation Processor.

When I run code using GPU its working fine but in case of Model Server this is not working

docker run --rm -it -v %cd%\ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --config_path ovmodels/model_config_list.json --rest_port 8000

[2024-10-03 14:13:45.386][1][serving][info][server.cpp:75] OpenVINO Model Server 2024.4.28219825c [2024-10-03 14:13:45.386][1][serving][info][server.cpp:76] OpenVINO backend c3152d32c9c7 [2024-10-03 14:13:45.386][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting Python version 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] [2024-10-03 14:13:45.559][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started [2024-10-03 14:13:45.766][1][modelmanager][info][modelmanager.cpp:125] Available devices for Open VINO: CPU, GPU [2024-10-03 14:13:45.768][1][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting [2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:191] GRPCServerModule started [2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9178 [2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting [2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:37] Will start 32 REST workers [2024-10-03 14:13:45.776][1][serving][info][http_server.cpp:269] REST server listening on port 8000 with 32 threads [evhttp_server.cc : 253] NET_LOG: Entering the event loop ... [2024-10-03 14:13:45.776][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started [2024-10-03 14:13:45.777][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000 [2024-10-03 14:13:45.777][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting [2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property. [2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property. [2024-10-03 14:13:45.796][1][serving][info][mediapipegraphdefinition.cpp:419] MediapipeGraphDefinition initializing graph nodes Inference requests aggregated statistic: Paged attention % of inference execution: -nan MatMul % of inference execution: -nan Total inference execution secs: 0

[2024-10-03 14:15:05.783][1][serving][error][llmnoderesources.cpp:169] Error during llm node initialization for models_path: /ovmodels/llama3.2-1b/1 exception: Exception from src/inference/src/cpp/remote_context.cpp:68: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:201: [GPU] out of GPU resources

[2024-10-03 14:15:05.783][1][serving][error][mediapipegraphdefinition.cpp:468] Failed to process LLM node graph llama3.2-1b [2024-10-03 14:15:05.783][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: llama3.2-1b state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent: [2024-10-03 14:15:05.784][1][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started [2024-10-03 14:15:05.785][115][modelmanager][info][modelmanager.cpp:1087] Started cleaner thread [2024-10-03 14:15:05.784][114][modelmanager][info][modelmanager.cpp:1068] Started model manager thread

image

anandnandagiri avatar Oct 03 '24 14:10 anandnandagiri

@anandnandagiri How much memory do you have assigned to the WLS? From Linux, there might be less memory assigned to the GPU. Try reducing the cache size in graph.pbtxt which in the demo is to to 8GB. Try lower value like 4 or even less.

dtrawins avatar Oct 04 '24 23:10 dtrawins

@dtrawins it worked well when I made changes to cache size to 2 in graph.pbtxt.

Info: I am using WSL2 on Window 10 Pro where no .wslconfig is present. it is running on default configure. I am using Docker Desktop to run Model Server (no Linux Distro) thru command prompt

Any help on below?

  1. Is there any link for documentation on graph.pbtxt. (To my surprise if I remove --volume /usr/lib/wsl:/usr/lib/wsl for docker model server GPU this is not at all running)
  2. I am not able to run models or at least convert text embedding models to OpenVINO format to run text embedding for Vector Store https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 using "optimum-cli export openvino". Does Optimum-cli can convert if yes does Model Server supports it?

anandnandagiri avatar Oct 05 '24 20:10 anandnandagiri

@anandnandagiri The graph documentation can be found here https://github.com/openvinotoolkit/model_server/blob/main/docs/llm/reference.md The mount parameters which you used are required to make the GPU accessible in the container on WSL. It is documented here https://github.com/openvinotoolkit/model_server/blob/main/docs/accelerators.md#starting-a-docker-container-with-intel-integrated-gpu-intel-data-center-gpu-flex-series-and-intel-arc-gpu Regarding embeddings, we just added support to OpenAI API embeddings endpoint. You can check the demo https://github.com/openvinotoolkit/model_server/blob/main/demos/embeddings/README.md There is documented export from HF to deploy the model in OVMS. nomic-embed-text model should work fine.

dtrawins avatar Oct 10 '24 21:10 dtrawins

@dtrawins I have followed Demos Embedding I see few issue with configuring graph.pbtxt, config.json and subconfig.json with docker image. Did I missed any?

config.json and folder structure image

graph.pbtxt

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: 'LOOPBACK:0',
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "/ovmodels/llama3.2-1b/1",
          plugin_config: '{}',
          enable_prefix_caching: false
          cache_size: 4,
          block_size: 16,
          dynamic_split_fuse: false,
          max_num_seqs: 25,
          max_num_batched_tokens:2048,          
          device: "GPU"
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}

subconfig.json

{
    "model_config_list": [],
    "mediapipe_config_list": [
        {
            "name": "Alibaba-NLP/gte-large-en-v1.5-embeddings",
            "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-embeddings"
          },
      {
        "name": "Alibaba-NLP/gte-large-en-v1.5-tokenizer",
        "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-tokenizer"
      }
    ]
  }

I am using docker GPU Image below is command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/config.json --rest_port 8000

I am getting following error

image

anandnandagiri avatar Oct 12 '24 10:10 anandnandagiri

@anandnandagiri I think you should have in the folder of gte_large_en_1.5v/models/graph.pbtxt the content of the graph specific for embeddings. I think you copied the same graph from llm pipeline. This graph file is defining which calculators should be applied and how they should be connected. Try to copy the once from embeddings demo.

dtrawins avatar Oct 17 '24 22:10 dtrawins

I am getting continous error and used graph.pbtxt from emedding

image

anandnandagiri avatar Oct 21 '24 11:10 anandnandagiri

@anandnandagiri There was a recent simplification in the docker command in the demo to drop the parameter --cpu_extension https://github.com/openvinotoolkit/model_server/commit/d720af74fa644ac8a57fa351e84e37ba824e9b62 which I assume you followed. It requires however building the image from the latest main branch. You could rebuild the image or add the parameter --cpu_extension /ovms/lib/libopenvino_tokenizers.so to the docker run command.

dtrawins avatar Oct 21 '24 20:10 dtrawins

@dtrawins I followed all the steps mentioned above. But I have renamed Models folder to gte-large-en-v1.5 to make some standards and run multiple models and below is screen shot of folder structure

image

Docker Command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/configembed.json --rest_port 8000 --cpu_extension /ovms/lib/libopenvino_tokenizers.so

I see below error image

anandnandagiri avatar Oct 24 '24 03:10 anandnandagiri

@anandnandagiri This message: Unable to find Calculator EmbeddingsCalculator suggest that you are using latest release which do not support embedding yet. This work is not released yet in public docker image. You need to build docker image from main.

atobiszei avatar Oct 25 '24 08:10 atobiszei

Embeddings are supported starting from 2024.5 releases

dtrawins avatar Mar 03 '25 12:03 dtrawins