transformer-deploy
transformer-deploy copied to clipboard
HF pipeline based inference
Hi @pommedeterresautee,
Please let me know your thoughts on converting the text classification also to HF pipeline, similar to Token classification and QA pipeline. I can work on this feature.
Thanks
Hi @kamalkraj,
Thank you for your proposition.
In token classif and QA there is a mechanic to transform scores output by the model into something a bit more actionable (extract spans, etc.). It seems to me that classification is a bit simpler, we can just reuse the scores directly. What do you think would be the reason for switching to a pipeline based model?
Kind regards, Michaël
Nb: I probably miss something obvious as I no XP with pipelines
Hi @pommedeterresautee,
If the model is directly outputting scores, the client who uses this model also needs to maintain index-2-label mapping. Whenever we change a model server, we also need to ensure the label-2-index mapping between client and server is in sync. If the server/model output is an exact label with a score, it will be much easier to integrate and less error-prone.
Thanks
Currently, this lib only supports single sentence classification. We can also add support for models trained on data like https://huggingface.co/datasets/snli
Hi @kamalkraj, to keep you updated we are thinking into writing our own CUDA kernels and run them on Pytorch directly (without any ONNX / TRT in between) and hope to reach decent performances (at least close to ONNX Runtime ones). If this works (which is not guaranteed at all), we would not need anymore to convert stuff from one framework to another.
Btw, what do you think of such approach (if this works and is totally transparent to the final user, like pip install XXX
and then optimize(model)
)? Would it be an issue for your use cases to not have an ONNX/TRT plan artefact? How do you balance ease of you use and perf?
Yes but much simpler to use (even user if he wants to should be able to compose their own fused kernel without knowing cuda) and if possible less monolithic (not layer wide). Also more PyTorch vanilla (basically some fused kernels and replace original code by FX). Still ... same spirit than deepspeed inference and torchDynamo -> stay in python during the put in prod step.
Okay. One more question, after optimization, how the model is integrated? Using Triton Inference Server or Direct integration to the program?
What do you mean by direct integration?
If inference, whatever the user wants, may be Triton
through BLS
, torchserve
or Ray
server (never tested).
The point is to be as light and invisible as possible.
Thank you for the clarification.
But I don't know why would somebody write a custom Cuda kernel to achieve a similar performance to the model exported by torch.onnx.export
. I know onnx export has limitations, but most of the time, it works fine.
Do you have any specific model or use case?