cnn-dailymail icon indicating copy to clipboard operation
cnn-dailymail copied to clipboard

Making Dataset for Bengali Language

Open PrithwirajRizu opened this issue 5 years ago • 2 comments

I have two files in Bengali. article.txt, and summary.txt. Now how can I convert it to corresponding train.bin, val.bin, test.bin? I just couldn't understand how to process my Bengali corpus for this summarization process. Thanks in advance.

PrithwirajRizu avatar Jul 16 '19 06:07 PrithwirajRizu

Hi @PrithwirajRizu Your story should be like this.

article = open('article.txt', 'r').read()
summary = open('summary.txt', 'r').read()

story = article + '\n\n' + '@highlight'+'\n'+summary 

Then follow this to generate train or test data.

sagorbrur avatar Sep 10 '19 20:09 sagorbrur

Hi @PrithwirajRizu Your story should be like this.

article = open('article.txt', 'r').read()
summary = open('summary.txt', 'r').read()

story = article + '\n\n' + '@highlight'+'\n'+summary 

Then follow this to generate train or test data.

I guess each sentence of the summary should be in a separate line and separated by the "@highlight" tag

senjed avatar Jun 16 '20 04:06 senjed