nqg icon indicating copy to clipboard operation
nqg copied to clipboard

What are the requirements for the input text?

Open indrajithi opened this issue 6 years ago • 5 comments

To generate new questions this line is used where config-trans specifies the input text:

$> th translate.lua -model model/<model file name> -config config-trans

1) What is the requirement of the input text? (stop words or any other requirements)

For sentence level model. 2) What qualifies as a good sentence for generating a question?

For paragraph level model. 3) On what basis should I split the paragraph into a sentence.

For sentence level model, I tried to generate questions by splitting input text into sentences based on . and gave the text file path in src field in config-trans.

For paragraph level model, I did the . based splitting for src file and for par file, I repeated the paragraph for each sentence in src.

Input text: One of the most basic techniques of molecular biology to study protein function is molecular cloning. In this technique, DNA coding for a protein of interest is cloned using polymerase chain reaction (PCR), and/or restriction enzymes into a plasmid ( expression vector). A vector has 3 distinctive features: an origin of replication, a multiple cloning site (MCS), and a selective marker usually antibiotic resistance. Located upstream of the multiple cloning site are the promoter regions and the transcription start site which regulate the expression of cloned gene.

src One of the most basic techniques of molecular biology to study protein function is molecular cloning. In this technique, DNA coding for a protein of interest is cloned using polymerase chain reaction (PCR), and/or restriction enzymes into a plasmid ( expression vector). A vector has 3 distinctive features: an origin of replication, a multiple cloning site (MCS), and a selective marker usually antibiotic resistance. Located upstream of the multiple cloning site are the promoter regions and the transcription start site which regulate the expression of cloned gene.

Where am I doing it wrong?

Note: From the paper ; DirectIn is an intuitive yet meaningful baseline in which the longest sub-sentence of the sentence is directly taken as the predicted question. To split the sentence into sub-sentences, we use a set of splitters, i.e. , {“?”, “!”, “,”, “.”, “;”}.

indrajithi avatar Jun 05 '18 10:06 indrajithi

Hi, were you able to get around this? I was wondering if we could obtain how they converted 'nqg/raw/' files to 'nqg/processed', then we should know how it was converted.

roshansridhar avatar Dec 12 '18 23:12 roshansridhar

whats the format for the text file which needs to be replaced in the paragraph/preprocess_embedding.sh file ??

Can i use this file ? glove.840B.300d.txt

or is there a way to generate a embedding text file by my own ??

SundeepPidugu avatar Apr 12 '19 07:04 SundeepPidugu

whats the format for the text file which needs to be replaced in the paragraph/preprocess_embedding.sh file ??

Can i use this file ? glove.840B.300d.txt

or is there a way to generate an embedding text file by my own ??

--embedding ../../archive/embeddings/glove.840B.300d.txt need to be replaced with the location of your word-vec pre-trained model glove.840B.300d.txt which can be downloaded from here . You can create your own word2vec trained model but pre-trained word2vec is from common crawl dataset, which I think is pretty good.

indrajithi avatar Apr 12 '19 07:04 indrajithi

@indrajithi where you able to give your own custom data as an input ?

suresh96458 avatar Feb 03 '20 11:02 suresh96458

@xinyadu can you help on how to give custom input data for predictions also as we have no idea how to convert from raw data to processed folder.

suresh96458 avatar Mar 19 '20 06:03 suresh96458