cdQA
cdQA copied to clipboard
Bert model prediction is slow , consider more practical implementation
There is a new approach from AliBaba described here. ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST
A Deep Cascade Model for Multi-Document Reading Comprehension
It seems faster with better scalability. Not sure there is code available.
Hi @alex-movila
Thank you for recommending this implementation!
Speed and scalability are indeed very important to use cdQA in production. At this stage, our primary focus is to make end-to-end question answering easy for our users but we might need to dig deeper into this new approach in the future, probably as soon as Alibaba has released some code (I couldn't find it yet).
We'll follow their updates on the topic closely 😉
There are some other tips here to make BERT better suited for production: https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/
Thanks for the tips @alex-movila !
However, I am afraid they are not compatible with cdQA for two reasons:
-
These tips are for the TF version of BERT, while we use the PyTorch version, provided by Hugging Face
-
The
bert-as-serviceproject has one feature: map a variable-length sentence to a fixed-length vector. This is a good feature to use for Information Extraction (it can be used for the Retriever component of cdQA for example - for now we use a TF-IDF approach). But it is not useful for the Question Answering part of the pipeline (i.e. the Reader), where we apply the modelBertForQuestionAnswering.
I will take a deeper look at it to see if I can extract some useful stuffs to our package.
I think the idea of quantization + pruning could be useful to make BERT smaller. Also for production there is need of concurrent requests and loading the model only at initialization.
Here is another idea. I think this one should be easy. To implement the reader with allennlp.: https://allenai.github.io/bi-att-flow/ https://allennlp.org/models
seems 2 lines of code.. I will try it myself. Though the problem remains that was not trained with non-answerable questions.
Related: Here is another paper for XML model which is available at huggingface Large Memory Layers with Product Keys (https://arxiv.org/abs/1907.05242) ""outperforms a baseline transformer model with 24 layers, while being twice faster at inference time"
Fwiw, using QAPipeline(reader='models/bert_qa_vGPU-sklearn.joblib', predict_batch_size=128, verbose_logging=True).to('cuda') gives me very reasonable inference time over a pretty large set of documents.
Obviously this might not be feasible for everyone, I'm just running this in Colab.
Well we must consider production where we have 1000 users doing inference concurrently. Also not everyone has GPU.
I think if we can achieve #209 and make cdQA modular (having possibility to chose different retriever/reader impementations) hopefully everyone will be able to have a solution given his/her specific constraints.
Well we must consider production where we have 1000 users doing inference concurrently. Also not everyone has GPU.
For sure on the first point, it obviously doesn't scale. Just wanted to point out that it does work on some level.
That said, I think it might be reasonable to expect the availability of a GPU, even if not a large one. Don't know much about deploying these models, but would inference on CPU really be feasible?
but would inference on CPU really be feasible?
Currently, inference on CPU is feasible but slow (about 10 to 20s per inference depending on the question and the CPU).
We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.
If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃
Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering
In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.
Now we have Albert: ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations
Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering
In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.
Yes, it's in our plans to add support for DistilBERT and other Transformer based models as well. We will be adding them soon.
There's also TinyBert: https://arxiv.org/abs/1909.10351
If ever they publish the model we can add it in cdQA too.
Thanks for the information @andrelmfarias - is the general procedure to add new models that listed in this PR?
There is a new approach from AliBaba described here. ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST
A Deep Cascade Model for Multi-Document Reading Comprehension
It seems faster with better scalability. Not sure there is code available.
Google just release "PAWS: Paraphrase Adversaries from Word Scrambling", maybe can benefit cdQA as well?
https://github.com/google-research-datasets/paws
Any news?
I've been testing cdQA for production, but it takes like 7 seconds to answer a question. Which is OK but kind of annoying for the user.
Can I add a faster model, like Distilbert?
Do you know where the bottle neck is? Is it the QA inference with BERT that is slow? What about the retriever step?
I know you are using tf-idf, but have you experimented with other representations at all?
@andrelmfarias
We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.
If you are interested in contributing to
cdQAworking with this particular issue, it can be very helpful and we could implement the solutions sooner 😃
Certainly not everyone will be interested in using GPU for inference, but I am!
I have been experimenting with some pretrained models from the NVIDIA NeMo (neural modules) framework - they are based on Hugging Face pretrained BERT checkpoints but optimized for fine-tuning and inference on GPUs. Since they are based on the Hugging Face models and use a Pytorch backend, I wonder if they could be easily plugged in to cdQA as a Reader?
Any idea where to start if I want to try this? I have a couple engineers who might be interested in making this happen.