cdQA Bert model prediction is slow , consider more practical implementation

There is a new approach from AliBaba described here. ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

Jul 10 '19 15:07 alex-movila

Hi @alex-movila

Thank you for recommending this implementation!

Speed and scalability are indeed very important to use cdQA in production. At this stage, our primary focus is to make end-to-end question answering easy for our users but we might need to dig deeper into this new approach in the future, probably as soon as Alibaba has released some code (I couldn't find it yet).

We'll follow their updates on the topic closely 😉

Jul 11 '19 08:07 n0thingLLM

There are some other tips here to make BERT better suited for production: https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/

Jul 17 '19 06:07 alex-movila

Thanks for the tips @alex-movila !

However, I am afraid they are not compatible with cdQA for two reasons:

These tips are for the TF version of BERT, while we use the PyTorch version, provided by Hugging Face
The bert-as-service project has one feature: map a variable-length sentence to a fixed-length vector. This is a good feature to use for Information Extraction (it can be used for the Retriever component of cdQA for example - for now we use a TF-IDF approach). But it is not useful for the Question Answering part of the pipeline (i.e. the Reader), where we apply the model BertForQuestionAnswering.

I will take a deeper look at it to see if I can extract some useful stuffs to our package.

Jul 17 '19 08:07 andrelmfarias

I think the idea of quantization + pruning could be useful to make BERT smaller. Also for production there is need of concurrent requests and loading the model only at initialization.

Here is another idea. I think this one should be easy. To implement the reader with allennlp.: https://allenai.github.io/bi-att-flow/ https://allennlp.org/models

seems 2 lines of code.. I will try it myself. Though the problem remains that was not trained with non-answerable questions.

Jul 17 '19 08:07 alex-movila

Related: Here is another paper for XML model which is available at huggingface Large Memory Layers with Product Keys (https://arxiv.org/abs/1907.05242) ""outperforms a baseline transformer model with 24 layers, while being twice faster at inference time"

Jul 17 '19 08:07 alex-movila

Fwiw, using QAPipeline(reader='models/bert_qa_vGPU-sklearn.joblib', predict_batch_size=128, verbose_logging=True).to('cuda') gives me very reasonable inference time over a pretty large set of documents.

Obviously this might not be feasible for everyone, I'm just running this in Colab.

Jul 19 '19 06:07 jacobdanovitch

Well we must consider production where we have 1000 users doing inference concurrently. Also not everyone has GPU.

Jul 19 '19 08:07 alex-movila

I think if we can achieve #209 and make cdQA modular (having possibility to chose different retriever/reader impementations) hopefully everyone will be able to have a solution given his/her specific constraints.

Jul 19 '19 09:07 n0thingLLM

Well we must consider production where we have 1000 users doing inference concurrently. Also not everyone has GPU.

For sure on the first point, it obviously doesn't scale. Just wanted to point out that it does work on some level.

That said, I think it might be reasonable to expect the availability of a GPU, even if not a large one. Don't know much about deploying these models, but would inference on CPU really be feasible?

Jul 21 '19 05:07 jacobdanovitch

but would inference on CPU really be feasible?

Currently, inference on CPU is feasible but slow (about 10 to 20s per inference depending on the question and the CPU).

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

Jul 22 '19 09:07 andrelmfarias

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

Oct 03 '19 16:10 lewtun

Now we have Albert: ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations

Oct 04 '19 10:10 alex-movila

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

Yes, it's in our plans to add support for DistilBERT and other Transformer based models as well. We will be adding them soon.

There's also TinyBert: https://arxiv.org/abs/1909.10351

If ever they publish the model we can add it in cdQA too.

Oct 05 '19 10:10 andrelmfarias

Thanks for the information @andrelmfarias - is the general procedure to add new models that listed in this PR?

Oct 05 '19 13:10 lewtun

There is a new approach from AliBaba described here. ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

Google just release "PAWS: Paraphrase Adversaries from Word Scrambling", maybe can benefit cdQA as well?
https://github.com/google-research-datasets/paws

Oct 11 '19 12:10 morningstar899

Any news?

I've been testing cdQA for production, but it takes like 7 seconds to answer a question. Which is OK but kind of annoying for the user.

Can I add a faster model, like Distilbert?

Mar 16 '20 10:03 Valdegg

Do you know where the bottle neck is? Is it the QA inference with BERT that is slow? What about the retriever step?

I know you are using tf-idf, but have you experimented with other representations at all?

Mar 17 '20 00:03 ehutt

@andrelmfarias

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

Certainly not everyone will be interested in using GPU for inference, but I am!

I have been experimenting with some pretrained models from the NVIDIA NeMo (neural modules) framework - they are based on Hugging Face pretrained BERT checkpoints but optimized for fine-tuning and inference on GPUs. Since they are based on the Hugging Face models and use a Pytorch backend, I wonder if they could be easily plugged in to cdQA as a Reader?

Any idea where to start if I want to try this? I have a couple engineers who might be interested in making this happen.

Mar 17 '20 19:03 ehutt

cdQA cdQA copied to clipboard

Bert model prediction is slow , consider more practical implementation

cdQA
cdQA copied to clipboard