flair icon indicating copy to clipboard operation
flair copied to clipboard

ONNX compatible models

Open helpmefindaname opened this issue 2 years ago • 11 comments

To be removed, once it is done: Please add the appropriate label to this ticket, e.g. feature or enhancement.

Is your feature/enhancement request related to a problem? Please describe. ONNX support is a frequently requested feature, some issues mention it (https://github.com/flairNLP/flair/issues/2625, https://github.com/flairNLP/flair/issues/2451, https://github.com/flairNLP/flair/issues/2317, https://github.com/flairNLP/flair/issues/1936, https://github.com/flairNLP/flair/issues/1423, https://github.com/flairNLP/flair/issues/999) so I think there is a big desire for the community to support it. I suppose the usual ONNX compatibility would also make the models compatible to torch.jit (https://github.com/flairNLP/flair/issues/2528) or AWS Neutron (https://github.com/flairNLP/flair/issues/2443)

ONNX provides large enhancements in terms of production readiness, it creates a static computational graph which can be quantized and optimized towards specific hardware, see https://onnxruntime.ai/docs/performance/tune-performance.html (it claims to be 17x faster)

Describe the solution you'd like I'd suggest iterative progression as multiple architecture changes are required:

  1. split the forward/forward_pass methods, such that all models have a method _prepare_tensors which converts all DataPoints to tensors and a forward which takes in tensors and outputs tensors (e.g. for the SeqeuenceTagger we the forward has the signature def forward(self, sentence_tensor: torch.Tensor, lengths: torch.LongTensor) and returns a single tensor scores) this change allows conversion to ONNX models, however the logic (like decoding crf scores, filling up sentence results, extracting tensors) won't be implemented. Also embeddings won't be part of the ONNX model.
  2. create the same forward/_prepare_tensors architecture for embeddings, such that those could be converted too. This would allow converting embeddings to ONNX models, but again without logic.
  3. change the architecture, that both embeddings and models have the logic part (creating inputs, adding outputs to data points) and the pytorch part be split, such that the pytorch model part can be replaced by a converted ONNX model.
  4. create an end-to-end model wrapper, that both embeddings & the model can be converted to a single ONNX model and used as such.

Notice that this would be 4 different PRs and probably all of them would be very large and should be tested a lot before moving to the next PR, I would offer to do the first one and then see how much effort this is/how much time I have for this.

helpmefindaname avatar Feb 20 '22 17:02 helpmefindaname

This would be very useful. Do you have any idea how large a piece of work this is (my gut feel is very)?

I can see if we can help with some of the work - I'll be honest, this wouldn't be my own speciality.

The first part is almost finished: https://github.com/flairNLP/flair/pull/2643 is ready for review.

That one surprisingly straight forward: First think of how to refactor a model and then apply the same to all other models too (as it is manly the same). Only the lemmatization model (encoder decoder architecture) has increased complexity.

The hardest part is deciding what kind of refactoring to apply, there it might be helpful already to just discuss/brainstorm how to do it.

I have some thoughts on the open tasks:

  1. The TransformersEmbeddings likely will be a bigger piece, maybe flair (pool) embeddings too, one would convert the lengths and indices to LongTensors to ensure everything is convertible. also, I think it would make sense to change the architecture, that Sentence stores the full embedding vector for the whole sequence instead of tokens storing the individual embeddings. That way, the forward method of the embeddings could return the already padded sequences. And embeddings.embed could return the raw tensors. We could make _prepare_tensors return a dictionary {embedding_name: tensor} so stacked embeddings have an easy way to handle them separated by embedding.
  2. This one struggles me a lot, the new architecture should be in a way that you don't need to load the pytorch weights if you use the onnx model and reversed. This could be done, by splitting the class into 2 classes (logic, vs Model), however it should also be easy to implement new models and slitting them up might make it too complicated.

helpmefindaname avatar Mar 09 '22 15:03 helpmefindaname

Is this code refactoring only for making Flair models compatible with ONNX? OR is it possible to quantize the Flair models without use of ONNX before code is refactored?

aytugkaya avatar Mar 28 '22 15:03 aytugkaya

As long as you are not using the flair embeddings with flair version < 0.11, you can apply dynamic quantisation on all flair models that run on cpu. However, you cannot store them due to the way embeddings are stored.

helpmefindaname avatar Mar 29 '22 18:03 helpmefindaname

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 30 '22 19:07 stale[bot]

ping to reveive issue as it isn't dead

helpmefindaname avatar Aug 09 '22 17:08 helpmefindaname

@helpmefindaname Were you able to finish this? I did an export of the german flair model to a single ONNX ~2 years ago but need the english version too, did you make any further progress in this matter?

edoust avatar Sep 11 '22 14:09 edoust

I created a script for ONNX export for the de-pos model, it is running just fine on the onnxruntime on .NET. Will test if I can also get an export for Core ML to work, in case anyone needs it you can find it here: https://github.com/edoust/flair/commits/master

edoust avatar Sep 11 '22 22:09 edoust

I created single file ONNX models from de-pos, flair/upos-multi, flair/upos-multi-fast that work with variable batch and sentence sizes.

Basically it first computes the forward and backward embeddings, selects the right embedding tensors from the total embeddings using the forwardIndices and backwardIndices. Then it concatenates the selected tensors, and "stripes" them into the final sentence_tensor using the striping

Input Shape Example Shape Example Description
forward characters x sentences (9,2) [mapped with char table] The mapped character input sequence for the forward language model
forwardIndices total_tokens (4) [6,14,5,17] The indices of the embeddings to take from the full embedding tensor
backward (9,2) [mapped with char table]
backwardIndices (4) [14,6,5,17]
striping total_embeddings (8) [0,4,1,5,2,6,3,7] Used to generate the sentence tensor from the concatenated forward and backward embedding tensors, e.g.
characterLengths sentences (2) [9,4] Required for keeping dynamic shapes right
lengths sentences (2) [2,1] Required for keeping dynamic shapes right

The above example values are given for the two short sentences Pla Gon and Xu.

@alanakbik @helpmefindaname Does this make sense to you, or is there an easier/better way to achieve a single ONNX model export that includes the embeddings? Did I miss anything? Any feedback would be appreciated

This is the visual model representation: de-pos-onnx

edoust avatar Sep 13 '22 17:09 edoust

Hi @edoust , sorry for the late answer

I think it will take a long time to finish this. So far the models can be exported without embeddings and the transformer embeddings themselves can be exported. The way I want to integrate the onnx export should be, that you can use torch.onnx.export and can use the exported model within the flair library. For this there are quite some architectural changes required where I am currently not sure how to handle them at best.

For the use case that you want to export it to another language (and therefore anyways have to recreate the code to handle inputs and outputs), I would say that your script looks quite solid. The only thing there is that I wonder if the striping is really necessary? Shouldn't it be possible to concatenate the embeddings on the embedding dimension at line https://github.com/flairNLP/flair/compare/master...edoust:flair:master#diff-2cdd6b2846dd6d89526228ebe147fc75f9b0aa7c999593a4ee32db2ae142adfdR74 ?

helpmefindaname avatar Sep 14 '22 17:09 helpmefindaname

Hi @helpmefindaname

thanks for the reply, you are right striping is not necessary, thanks for that :)

Regarding the ONNX export, I think it would be great to have the possibility to create single file ONNX model exports from various Flair models (combining embeddings and tagging model), otherwise it is a very high effort to include such a Flair model in any app. Having such an option would make the integration into (native) non-python apps/services much easier

edoust avatar Sep 16 '22 08:09 edoust

Hi all, I'm interested in this as well, to speed up Flair inference. Do you have any measurements of performance of some models? I'd be interested in GPU vs. CPU vanilla vs. CPU ONNX/TorchScript.

jonashaag avatar Oct 01 '22 19:10 jonashaag

Hi @jonashaag I did some evaluation for the TransformerEmbeddings in this PR. Notice that the times heavily depend on your devices. A cheap CPU will be way slower than a strong CPU and same for GPU. At the end, you have to evaluate it yourself for your hardware

helpmefindaname avatar Oct 02 '22 21:10 helpmefindaname

I can’t find any numbers there. Can you please point me to them?

jonashaag avatar Oct 03 '22 06:10 jonashaag

sorry, wrong PR, I meant this one: https://github.com/flairNLP/flair/pull/2739

helpmefindaname avatar Oct 05 '22 21:10 helpmefindaname

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 19:03 stale[bot]