transformer-deploy icon indicating copy to clipboard operation
transformer-deploy copied to clipboard

HF pipeline based inference

Open kamalkraj opened this issue 2 years ago • 9 comments

Hi @pommedeterresautee,

Please let me know your thoughts on converting the text classification also to HF pipeline, similar to Token classification and QA pipeline. I can work on this feature.

Thanks

kamalkraj avatar Jul 05 '22 19:07 kamalkraj

Hi @kamalkraj,

Thank you for your proposition.

In token classif and QA there is a mechanic to transform scores output by the model into something a bit more actionable (extract spans, etc.). It seems to me that classification is a bit simpler, we can just reuse the scores directly. What do you think would be the reason for switching to a pipeline based model?

Kind regards, Michaël

Nb: I probably miss something obvious as I no XP with pipelines

pommedeterresautee avatar Jul 05 '22 20:07 pommedeterresautee

Hi @pommedeterresautee,

If the model is directly outputting scores, the client who uses this model also needs to maintain index-2-label mapping. Whenever we change a model server, we also need to ensure the label-2-index mapping between client and server is in sync. If the server/model output is an exact label with a score, it will be much easier to integrate and less error-prone.

Thanks

kamalkraj avatar Jul 05 '22 20:07 kamalkraj

Currently, this lib only supports single sentence classification. We can also add support for models trained on data like https://huggingface.co/datasets/snli

kamalkraj avatar Jul 05 '22 20:07 kamalkraj

Hi @kamalkraj, to keep you updated we are thinking into writing our own CUDA kernels and run them on Pytorch directly (without any ONNX / TRT in between) and hope to reach decent performances (at least close to ONNX Runtime ones). If this works (which is not guaranteed at all), we would not need anymore to convert stuff from one framework to another.

Btw, what do you think of such approach (if this works and is totally transparent to the final user, like pip install XXX and then optimize(model))? Would it be an issue for your use cases to not have an ONNX/TRT plan artefact? How do you balance ease of you use and perf?

pommedeterresautee avatar Jul 18 '22 06:07 pommedeterresautee

Hi @pommedeterresautee,

By own CUDA kernels, do you mean something like deepspeed?

kamalkraj avatar Jul 18 '22 10:07 kamalkraj

Yes but much simpler to use (even user if he wants to should be able to compose their own fused kernel without knowing cuda) and if possible less monolithic (not layer wide). Also more PyTorch vanilla (basically some fused kernels and replace original code by FX). Still ... same spirit than deepspeed inference and torchDynamo -> stay in python during the put in prod step.

pommedeterresautee avatar Jul 18 '22 10:07 pommedeterresautee

Okay. One more question, after optimization, how the model is integrated? Using Triton Inference Server or Direct integration to the program?

kamalkraj avatar Jul 18 '22 12:07 kamalkraj

What do you mean by direct integration? If inference, whatever the user wants, may be Triton through BLS, torchserve or Ray server (never tested). The point is to be as light and invisible as possible.

pommedeterresautee avatar Jul 18 '22 12:07 pommedeterresautee

Thank you for the clarification. But I don't know why would somebody write a custom Cuda kernel to achieve a similar performance to the model exported by torch.onnx.export. I know onnx export has limitations, but most of the time, it works fine. Do you have any specific model or use case?

kamalkraj avatar Jul 19 '22 06:07 kamalkraj