dialog-nlu
dialog-nlu copied to clipboard
Support TensorRT conversion and serving feature
I realized that the Tensorflow Lite does not support inference with using Nvidia GPU. I have a device of Nvidia Jetson Xavier. My current inference is made with unoptimized transformers model on GPU. It is faster than inference with TF Lite model on CPU.
After my research, I have found 2 types of model optimization such as TensorRT or TF-TRT. I have made some trials to achieve the conversion of fine-tuned transformers model to TensorRT but I could not achieve. It would be better if the dialog-nlu supports TensorRT conversion and serving feature.
Hi @redrussianarmy Thank you for sharing your experience. I'll give it a try and let you know.
Tflite doesn't support serving on PC GPUs, but supports mobile GPUs. I don't know if it supports all edge devices GPUs or not.
One question that came to my mind:
Did you try mixing transformers with layer_pruning feature and tflite conversion with hybrid_quantization?
k_layers_to_prune = 4 # try different values
config = {
...
...
"layer_pruning": {
"strategy": "top",
"k": k_layers_to_prune
}
}
nlu = TransformerNLU.from_config(config)
nlu.train(train_dataset, val_dataset, epochs, batch_size)
nlu.save(save_path, save_tflite=True, conversion_mode="hybrid_quantization")
nlu = TransformerNLU.load(model_path, quantized=True, num_process=4)
utterance = "add sabrina salerno to the grime instrumentals playlist"
result = nlu.predict(utterance)
Hi @MahmoudWahdan Thank you for your quick reply.
I have tried mixing transformers with layer_pruning feature and tflite conversion with hybrid_quantization as you mentioned. Unfortunately, the result is same. Prediction does not work on GPU of Nvidia Jetson Xavier.
I am looking forward to seeing new TensorRT conversion feature :)
Hi @redrussianarmy Sure, This is a new thing that I'll try and of course it will be useful. I'll keep you updated.