cdQA
cdQA copied to clipboard
How to use SQuAD for chinese (Close-Domain)QA task
I have three questions First: Can i directly use SQuAD for chinese (Close-Domain)QA task?
Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!
ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?
Hi, Answering to your questions:
First: Can i directly use SQuAD for chinese (Close-Domain)QA task?
I don't really understand it, SQuAD is a QA dataset in English, you would need a "Chinese" version of a QA dataset. Maybe your question is if you can use BERT for Chinese? If it is, you should be trying BERT multilingual, I am not sure about its performance though...
Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!
To fine-tune bert model with a chinese dataset, I advise you to use the run_squad.py
example in the Hugging Face's repository with the bert-base-multilingual-(un)cased
version.
ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?
Unfortunately, I don't have an answer to this question 😞
Hi @weinixuehao
You can use cdQA in chinese, but it requires some additional work. The idea is to:
- Find a SQuAD-like dataset in Chinese. It should have the same json schema as the SQuAD. For example you could use the DuReader QA dataset released by Baidu but you might need to convert it to SQuAD format.
- Use our notebook to train the reader on your chinese SQuAD-like dataset. You should instantiate the BERT classes with the chinese pre-trained language model
bert-base-chinese
, then fine-tune on your chinese SQuAD-like dataset. - Once your reader is built, you can couple it with a retriever that is adapted to chinese language (chinese tokenizer, chinese stopwords, etc...)
Then you should be able to do closed-domain QA on your own chinese documents.
@andrelmfarias @fmikaelian Thanks to answer my question! This is what i need.
Hi @fmikaelian SQuAD(around 30M) dataset size less than DuReader dateset(around 1~2G per file) Need I convert all DuReader dataset to SQuAD-like dataset to train? May be it takes much time to convert and train.