qnabot-on-aws icon indicating copy to clipboard operation
qnabot-on-aws copied to clipboard

Ingestion utility for unstructured text documents

Open rstrahan opened this issue 6 years ago • 2 comments

Make it easier/quicker to populate QnABot using information contained in existing text documents. Minimize need for manual curation. For example, create an 'ingestion utility' that a) Separates each paragraph in an unstructured text document to create separate QnA items. b) Uses Amazon Comprehend to discover entities and topics - generate automatic questions for each item using extracted keywords. c) Uploads full document to S3, and incorporates link to full document into answer for each generated item.

rstrahan avatar Jan 02 '18 13:01 rstrahan

This article came out today about automated ingestion of a corpus and production of resultant question/answer matches: http://money.cnn.com/2018/01/15/technology/reading-robot-alibaba-microsoft-stanford/index.html. How do we get our hands on that model - and feed results into QNA bot?

bigrig2212 avatar Jan 16 '18 01:01 bigrig2212

that is interesting, i could not find a description on the model they used (probably just have not published it yet). I did find the website for the "Stanford question and answer data set": https://rajpurkar.github.io/SQuAD-explorer/. Some of the other models do a paper describing them.

However, my guess is they models would require MASSIVE datasets to perform well, some QnABot uses do have the data that is required ie. simple FAQ bots.

related to #59, another approach is instead of ingesting data into ElasticSearch in our format, we change to search mechanism.

JohnCalhoun avatar Jan 16 '18 14:01 JohnCalhoun