flair
flair copied to clipboard
ONNX compatible models
To be removed, once it is done: Please add the appropriate label to this ticket, e.g. feature or enhancement.
Is your feature/enhancement request related to a problem? Please describe. ONNX support is a frequently requested feature, some issues mention it (https://github.com/flairNLP/flair/issues/2625, https://github.com/flairNLP/flair/issues/2451, https://github.com/flairNLP/flair/issues/2317, https://github.com/flairNLP/flair/issues/1936, https://github.com/flairNLP/flair/issues/1423, https://github.com/flairNLP/flair/issues/999) so I think there is a big desire for the community to support it. I suppose the usual ONNX compatibility would also make the models compatible to torch.jit (https://github.com/flairNLP/flair/issues/2528) or AWS Neutron (https://github.com/flairNLP/flair/issues/2443)
ONNX provides large enhancements in terms of production readiness, it creates a static computational graph which can be quantized and optimized towards specific hardware, see https://onnxruntime.ai/docs/performance/tune-performance.html (it claims to be 17x faster)
Describe the solution you'd like I'd suggest iterative progression as multiple architecture changes are required:
- split the
forward
/forward_pass
methods, such that all models have a method_prepare_tensors
which converts all DataPoints to tensors and aforward
which takes in tensors and outputs tensors (e.g. for the SeqeuenceTagger we the forward has the signaturedef forward(self, sentence_tensor: torch.Tensor, lengths: torch.LongTensor)
and returns a single tensorscores
) this change allows conversion to ONNX models, however the logic (like decoding crf scores, filling up sentence results, extracting tensors) won't be implemented. Also embeddings won't be part of the ONNX model. - create the same
forward
/_prepare_tensors
architecture for embeddings, such that those could be converted too. This would allow converting embeddings to ONNX models, but again without logic. - change the architecture, that both embeddings and models have the logic part (creating inputs, adding outputs to data points) and the pytorch part be split, such that the pytorch model part can be replaced by a converted ONNX model.
- create an end-to-end model wrapper, that both embeddings & the model can be converted to a single ONNX model and used as such.
Notice that this would be 4 different PRs and probably all of them would be very large and should be tested a lot before moving to the next PR, I would offer to do the first one and then see how much effort this is/how much time I have for this.
This would be very useful. Do you have any idea how large a piece of work this is (my gut feel is very)?
I can see if we can help with some of the work - I'll be honest, this wouldn't be my own speciality.
The first part is almost finished: https://github.com/flairNLP/flair/pull/2643 is ready for review.
That one surprisingly straight forward: First think of how to refactor a model and then apply the same to all other models too (as it is manly the same). Only the lemmatization model (encoder decoder architecture) has increased complexity.
The hardest part is deciding what kind of refactoring to apply, there it might be helpful already to just discuss/brainstorm how to do it.
I have some thoughts on the open tasks:
- The TransformersEmbeddings likely will be a bigger piece, maybe flair (pool) embeddings too, one would convert the lengths and indices to LongTensors to ensure everything is convertible. also, I think it would make sense to change the architecture, that Sentence stores the full embedding vector for the whole sequence instead of tokens storing the individual embeddings. That way, the forward method of the embeddings could return the already padded sequences. And embeddings.embed could return the raw tensors. We could make _prepare_tensors return a dictionary {embedding_name: tensor} so stacked embeddings have an easy way to handle them separated by embedding.
- This one struggles me a lot, the new architecture should be in a way that you don't need to load the pytorch weights if you use the onnx model and reversed. This could be done, by splitting the class into 2 classes (logic, vs Model), however it should also be easy to implement new models and slitting them up might make it too complicated.
Is this code refactoring only for making Flair models compatible with ONNX? OR is it possible to quantize the Flair models without use of ONNX before code is refactored?
As long as you are not using the flair embeddings with flair version < 0.11, you can apply dynamic quantisation on all flair models that run on cpu. However, you cannot store them due to the way embeddings are stored.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
ping to reveive issue as it isn't dead
@helpmefindaname Were you able to finish this? I did an export of the german flair model to a single ONNX ~2 years ago but need the english version too, did you make any further progress in this matter?
I created a script for ONNX export for the de-pos model, it is running just fine on the onnxruntime on .NET. Will test if I can also get an export for Core ML to work, in case anyone needs it you can find it here: https://github.com/edoust/flair/commits/master
I created single file ONNX models from de-pos
, flair/upos-multi
, flair/upos-multi-fast
that work with variable batch and sentence sizes.
Basically it first computes the forward and backward embeddings, selects the right embedding tensors from the total embeddings using the forwardIndices
and backwardIndices
. Then it concatenates the selected tensors, and "stripes" them into the final sentence_tensor
using the striping
Input | Shape | Example Shape | Example | Description |
---|---|---|---|---|
forward | characters x sentences | (9,2) |
[mapped with char table] |
The mapped character input sequence for the forward language model |
forwardIndices | total_tokens | (4) |
[6,14,5,17] |
The indices of the embeddings to take from the full embedding tensor |
backward | (9,2) |
[mapped with char table] |
||
backwardIndices | (4) |
[14,6,5,17] |
||
striping | total_embeddings | (8) |
[0,4,1,5,2,6,3,7] |
Used to generate the sentence tensor from the concatenated forward and backward embedding tensors, e.g. |
characterLengths | sentences | (2) |
[9,4] |
Required for keeping dynamic shapes right |
lengths | sentences | (2) |
[2,1] |
Required for keeping dynamic shapes right |
The above example values are given for the two short sentences Pla Gon
and Xu
.
@alanakbik @helpmefindaname Does this make sense to you, or is there an easier/better way to achieve a single ONNX model export that includes the embeddings? Did I miss anything? Any feedback would be appreciated
This is the visual model representation:
Hi @edoust , sorry for the late answer
I think it will take a long time to finish this. So far the models can be exported without embeddings and the transformer embeddings themselves can be exported. The way I want to integrate the onnx export should be, that you can use torch.onnx.export
and can use the exported model within the flair library. For this there are quite some architectural changes required where I am currently not sure how to handle them at best.
For the use case that you want to export it to another language (and therefore anyways have to recreate the code to handle inputs and outputs), I would say that your script looks quite solid.
The only thing there is that I wonder if the striping
is really necessary? Shouldn't it be possible to concatenate the embeddings on the embedding dimension at line https://github.com/flairNLP/flair/compare/master...edoust:flair:master#diff-2cdd6b2846dd6d89526228ebe147fc75f9b0aa7c999593a4ee32db2ae142adfdR74 ?
Hi @helpmefindaname
thanks for the reply, you are right striping
is not necessary, thanks for that :)
Regarding the ONNX export, I think it would be great to have the possibility to create single file ONNX model exports from various Flair models (combining embeddings and tagging model), otherwise it is a very high effort to include such a Flair model in any app. Having such an option would make the integration into (native) non-python apps/services much easier
Hi all, I'm interested in this as well, to speed up Flair inference. Do you have any measurements of performance of some models? I'd be interested in GPU vs. CPU vanilla vs. CPU ONNX/TorchScript.
Hi @jonashaag I did some evaluation for the TransformerEmbeddings in this PR. Notice that the times heavily depend on your devices. A cheap CPU will be way slower than a strong CPU and same for GPU. At the end, you have to evaluate it yourself for your hardware
I can’t find any numbers there. Can you please point me to them?
sorry, wrong PR, I meant this one: https://github.com/flairNLP/flair/pull/2739
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.