cnn-dailymail
cnn-dailymail copied to clipboard
Making Dataset for Bengali Language
I have two files in Bengali. article.txt, and summary.txt. Now how can I convert it to corresponding train.bin, val.bin, test.bin? I just couldn't understand how to process my Bengali corpus for this summarization process. Thanks in advance.
Hi @PrithwirajRizu Your story should be like this.
article = open('article.txt', 'r').read()
summary = open('summary.txt', 'r').read()
story = article + '\n\n' + '@highlight'+'\n'+summary
Then follow this to generate train or test data.
Hi @PrithwirajRizu Your story should be like this.
article = open('article.txt', 'r').read() summary = open('summary.txt', 'r').read() story = article + '\n\n' + '@highlight'+'\n'+summary
Then follow this to generate train or test data.
I guess each sentence of the summary should be in a separate line and separated by the "@highlight" tag