server icon indicating copy to clipboard operation
server copied to clipboard

[docs] A complete example of a basic string processing model and of calling it of invoking it via `curl`

Open vadimkantorov opened this issue 2 years ago • 24 comments

I have the following echo-like models/modelA/1/model.py.

How can I call it from a command-line using curl?

curl -i -X POST localhost:8000/api/infer/modelA/1 -H "Content-Type: application/octet-stream" -H 'NV-InferRequest:batch_size: 1 input { name: "INPUT0" } output { name: "OUTPUT0" }' --data 'hello' gives HTTP 400.

I think I am not formatting the --data argument correctly. But it should not be very difficult for UTF-8 encoding, right?

Model code:

import json
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        auto_complete_model_config.add_input( {"name": "INPUT0",  "data_type": "TYPE_UINT8", "dims": [-1]})
        auto_complete_model_config.add_output({"name": "OUTPUT0", "data_type": "TYPE_UINT8", "dims": [-1]})
        auto_complete_model_config.set_max_batch_size(0)
        return auto_complete_model_config

    def execute(self, requests):
        responses = []
        for request in requests:
            in_numpy = pb_utils.get_input_tensor_by_name(request, 'INPUT0').as_numpy()
            in_str = str(bytes(in_numpy), 'utf8')
            
            out_str = 'modelA:' + in_str
            out_numpy = np.frombuffer(bytes(out_str, 'utf8'), dtype = np.uint8)
            out_pb = pb_utils.Tensor('OUTPUT0', out_numpy)

            responses.append(pb_utils.InferenceResponse(output_tensors = [out_pb]))
        return responses

Currently I can successfully call it using:

import numpy as np
import tritonclient.http as httpclient

triton_client = httpclient.InferenceServerClient("localhost:8000")
model_name = 'modelA'

input_arr = np.frombuffer(bytes('hello', 'utf8'), dtype = np.uint8)
inputs = [httpclient.InferInput("INPUT0", input_arr.shape, "UINT8")]
inputs[0].set_data_from_numpy(input_arr, binary_data=True)

res = triton_client.infer(model_name=model_name, inputs=inputs)

output_arr = res.as_numpy('OUTPUT0')
output_str = str(bytes(output_arr), 'utf8')

print(output_str)

but I would like to use curl as it's simpler for some demonstration purposes and might be simpler for debugging string-processing models

vadimkantorov avatar Sep 20 '23 22:09 vadimkantorov

I somehow solved it: https://github.com/vadimkantorov/tritoninfererenceserverstringprocprimer, but it would be nice to include something like this (the simplest possible curl-calling of a string-processing model) as an example in the README

vadimkantorov avatar Sep 21 '23 04:09 vadimkantorov

Thanks for solving this and sharing your code! We just released a generate endpoint (documentation here) that should hopefully make this exact use case. I think you can use any input instead of "text_input".

We are looking at options to make sending simple requests to Triton easier, so we will definitely be looking at this in the near future. If we do not do that soon, we can look at adding this as an example in our documentation.

the-david-oy avatar Oct 19 '23 17:10 the-david-oy

@dyastremsky Does this generate endpoint support batch-inference? E.g. providing a batch of text inputs and providing a batch of text outputs back?

It would also be nice to have natively supported JSON inputs (including a batch/array of JSON inputs). I think this is very common in post-processing. So it would be nice to have a native support without having to manually serialize/deserialize from bytes/string

It's also important to have some extremely-simple format for calling by HTTP. Currently it must contain some prefix specifying the body length (required by grpc I guess?), but it's not very elegant. It would be nice to support a mode without this - for prototyping and kicking off requests from command line without having to calculate the length of the string used as input to the model. It would also be nice to support a mode when this meta-info is specified in some HTTP headers - there exist native, nice ways of providing these extra HTTP headers in curl

vadimkantorov avatar Oct 21 '23 17:10 vadimkantorov

@dyastremsky Could you please hint on an example of a model code that knows how to support such a generate method? Is there anything to be done in Triton command line options or build-from-source options required to enable the generate extension? Should I pass --enable-generate to ./build.py?

Thank you :)

vadimkantorov avatar Oct 22 '23 17:10 vadimkantorov

Hi Vadim! Great questions. As far as getting the generate extension working, I think you just need to build off main or wait for the 23.10 release. CC: @GuanLuo

The generate endpoint is meant to simplify basic LLM inference and not meant to be run as a highly performant solution in production. So things like batching inputs on the client side are not supported and I don't expect we'll be adding more complicated use cases.

We do have in our queue to look into JSON support and easier client interaction. We also always encourage and appreciate contributions to the project.

the-david-oy avatar Oct 23 '23 18:10 the-david-oy

Supporting batches may sometimes be useful for the coding/UX/uniformity standpoint, even if it doesn't increase perf for se. This can also be useful for some string postproc

vadimkantorov avatar Oct 23 '23 22:10 vadimkantorov

Thanks, Vadim! I received a clarification from the team that this should theoretically support batching if an array of inputs is provided for each batched input. However, we do not specifically test or document it. We have created a ticket to test and document batching with the generate endpoint to ensure it works and is available.

Ticket reference number: DLIS-5717.

the-david-oy avatar Oct 26 '23 17:10 the-david-oy

Regarding the generate endpoint, it would also be great to have a model code example :) seems missing from the docs PR you linked above

vadimkantorov avatar Oct 29 '23 13:10 vadimkantorov

Noted, thanks for the suggestion Vadim!

the-david-oy avatar Oct 30 '23 17:10 the-david-oy

@vadimkantorov Have you looked at the example in inference protocol documentation? This is the example to send an inference request via curl. Your example is trying to use binary data extension which requires additional header, Inference-Header-Content-Length, for the server to parse the request.

If your model expects string/byte type instead of uint8, you don't need to put the integer representation of the string in the data field, but the string itself directly, i.e. ... "data" : ["hello"] . And with inference protocol, you can send batched request by properly setting the shape and tensor data, , i.e. "... shape" : [2, 1]", data" : [["hello"], ["another text"]] .

As @dyastremsky mentioned, you may send batched request via generate endpoint if you orchestrate your model properly, all I/Os' batch dimension is the only dynamic dimension so the server will map JSON entries to model inputs in batch. Inference protocol gives you more control when your use case is complex. The generate protocol is meant to provide a simple way to execute the model without needing to specify the tensor representation, which imposes restrictions and makes it difficult to be used for other use case.

GuanLuo avatar Oct 31 '23 15:10 GuanLuo

For now, I managed to make it work via this binary data extension - so I am not blocked.

Have you looked at the example in inference protocol documentation?

I did look at this, but I did not understand how to pass a string in the data field and how to read it in the string form in the model code (I think there was no complete code example)

As @dyastremsky mentioned, you may send batched request via generate endpoint

For string postproc, this would be perfect, but again I could not find a complete model code example (implementing this generate endpoint) for now.

Then again what'd be nice is an example of generate endpoint supporting batches of text inputs and an endpoint supporting batches of arbitrary JSON inputs/outputs (maybe alread supported via this Predict Protocol - Version 2? a complete example of model code + calling the endpoint with curl would be perfect)

vadimkantorov avatar Oct 31 '23 16:10 vadimkantorov

E.g. the generate endpoint docs stress on using text_input as field name, but the code in https://github.com/triton-inference-server/server/blob/9da513528e34a9b91a216cd7e1b668fee9fbbc92/qa/L0_http/generate_models/mock_llm/1/model.py uses some other PROMPT field name. Are these field names standardized wrt generate endpoint or are fully custom?

Also, in the code example, the strings stay encoded as numpy tensors, but what would be great is getting a regular python string out of them, and it would demonstrate how to do proper decoding (regarding utf-8/utf-16 etc)

vadimkantorov avatar Nov 07 '23 14:11 vadimkantorov

Prompt was what we were originally using for the vLLM tutorials. They should be fully custom, so you can use this endpoint with different models.

Which code example? This endpoint is meant to be a quality-of-life improvement for easy testing over the command line. There is no intent to build it out as another avenue for speaking with Triton programmatically, as far as I know. However, I'll defer to @GuanLuo as to which of these features we want to look into building out.

the-david-oy avatar Nov 07 '23 18:11 the-david-oy

Which code example?

Maybe I'm not well-versed in Triton lingo or docs structure (what exactly does endpoint interface mean in the Triton's context - so that's why I thought that it only supports some sort of only text_input-named field). What I meant is a complete Python-backend code example implementing a generate-callable method which grabs a proper python string from the input inference requests, transforms it somehow and then serializes a python string into an inference response.

vadimkantorov avatar Nov 07 '23 20:11 vadimkantorov

Maybe I basically lack understanding of how TYPE_STRING works and what encoding is assumed and used under the hood?

vadimkantorov avatar Dec 07 '23 11:12 vadimkantorov

So I created a following example which seems to work. But still unclear how to pass a batch of examples in a single InferenceRequest, especially using generate endpoint. I also discovered that no special command-line build-time switch is needed to include the generate endpoint

curl -i -X POST localhost:8000/v2/models/modelC/generate -d '{"text_input": "Hello"}'
#HTTP/1.1 200 OK
#Content-Type: application/json
#Content-Type: application/json
#Transfer-Encoding: chunked
#{"model_name":"modelC","model_version":"1","text_output":"modelC: Hello World"}

curl -i -X POST localhost:8000/v2/models/modelC/infer --header 'Content-Type: application/json' --data-raw '{"inputs":[ { "name": "text_input", "shape": [1], "datatype": "BYTES", "data":  ["Hello"]  }  ] }'
#HTTP/1.1 200 OK
#Content-Type: application/json
#Content-Length: 140
#{"model_name":"modelC","model_version":"1","outputs":[{"name":"text_output","datatype":"BYTES","shape":[1],"data":["modelC: Hello World"]}]}
# cat models/modelC/1/model.py

import json
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.model_name = args['model_name']

    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        auto_complete_model_config.add_input( {"name": "text_input",  "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.add_output({"name": "text_output", "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.set_max_batch_size(0)
        return auto_complete_model_config

    def execute(self, requests):
        responses = []
        for request in requests:
            in_numpy = pb_utils.get_input_tensor_by_name(request, "text_input").as_numpy()
            assert np.object_ == in_numpy.dtype, 'in this demo, triton passes in a numpy array of size 1 with object_ dtype'
            assert 1 == len(in_numpy), 'in this demo, only supporting a single string per inference request'
            assert bytes == type(in_numpy.tolist()[0]), 'this object encapsulates a byte-array'
            str1 = in_numpy.tolist()[0].decode('utf-8')
            str2 = self.model_name + ': ' + str1 + ' World'

            out_numpy = np.array([str2.encode('utf-8')], dtype = np.object_)
            out_pb = pb_utils.Tensor("text_output", out_numpy)
            responses.append(pb_utils.InferenceResponse(output_tensors = [out_pb]))
        return responses

vadimkantorov avatar Dec 27 '23 17:12 vadimkantorov

But it's still not very clear which encoding will be used by bash/curl for sending the text and how is the accepted JSON byte blob will be sent to the model? does it do any string transcoding? Is the byte array passed to the mode "as is"? Does triton ensure that the passed string is "utf-8"? Is there any way for the user to specify in the input JSON the text encoding that is used?

It would be nice to have these questions addressed in the documentation. The encoding question is very important for all non-English usages, for both UTF-8 and UTF-16/UTF-32 can be found in real life, especially if they are coming from some third source (like storage / database)

vadimkantorov avatar Dec 27 '23 19:12 vadimkantorov

Hi Vadim! Apologies for dropping off on this.

@GuanLuo, can you provide more details on the encoding? It sounds like it may be helpful to expand the documentation a bit to explain this for non-English usage.

the-david-oy avatar Feb 20 '24 18:02 the-david-oy

Yeah, currently TYPE_STRINGS makes you think that it's about actually passing strings, while it seems simply representing a byte array. At the very least, it needs some clear explanation (as word STRING evokes the questions about encodings)

What's unclear is what's happening if we have some unicode symbols in the passed-in string in the JSON object? or some \u1234-like escaped code points?

And yep, simplifying this scenario of passing in/out actually python strings (and maybe setting encoding in model config as utf-8) for the most simple scenarios

vadimkantorov avatar Feb 20 '24 18:02 vadimkantorov

Thanks for clarifying! We filed a ticket to look into adding more details to the documentation.

Ref: 6189

the-david-oy avatar Feb 21 '24 00:02 the-david-oy

@dyastremsky Does generate endpoint support submitting a list/batch of requests in one go (via curl)? This would make it support OpenAI-like API.

Currently, in https://github.com/vadimkantorov/tritoninfererenceserverstringprocprimer/ I made generate endpoint work with a single dict-request:

curl -i -X POST localhost:8000/v2/models/modelC/generate -d '{"text_input": "Hello"}'

Instead, I'd like to have:

curl -i -X POST localhost:8000/v2/models/modelC/generate -d '[{"text_input": "Hello1"},{"text_input": "Hello2"}]'

vadimkantorov avatar Jun 05 '24 11:06 vadimkantorov

Thanks, Vadim! I received a clarification from the team that this should theoretically support batching if an array of inputs is provided for each batched input.

@dyastremsky Does this mean that instead a transposed scheme is supported (though still not documented)? e.g.

curl -i -X POST localhost:8000/v2/models/modelC/generate -d '{"text_input": ["Hello1", "Hello2"]}'

In general, if we'd like to process custom/ad-hoc fields in the JSON request, is it possible to map somehow to InferenceRequest (e.g. even providing the whole JSON request as some serialized input, which I can decode and interpret on my own)

vadimkantorov avatar Jun 05 '24 11:06 vadimkantorov

Thanks for the question, Vadim. Redirecting to @GuanLuo, who would know more about batching here.

the-david-oy avatar Jun 05 '24 17:06 the-david-oy

@GuanLuo Would you please consider this?

sadransh avatar Jun 19 '24 22:06 sadransh