BERT-pytorch icon indicating copy to clipboard operation
BERT-pytorch copied to clipboard

the format of input

Open eveliao opened this issue 7 years ago • 4 comments

You mentioned that

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

and gave an example:

Welcome to the \t the jungle\n I can stay \t here all night\n

However, the example is actually ONE sentence in one line. Should it be:

Welcome to the jungle \t I can stay here all night\n

(suppose these two sentences are continuous in the broader context)

eveliao avatar Oct 29 '18 11:10 eveliao

I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍

codertimo avatar Oct 29 '18 13:10 codertimo

I am interested in the prediction of next sentence. If the input data are all continuous sentences, how can the model randomly select 50% for the continuous and 50% for the discontinuous?

JustinLin610 avatar Dec 11 '18 05:12 JustinLin610

And I also think if you mention there need spaces around '\t' is better, unless we will have more vocabs if we don't have spaces.

andy-yangz avatar Dec 11 '18 06:12 andy-yangz

Yeah, this is clear then.

I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍

PandasPan avatar Apr 24 '21 01:04 PandasPan