BERT-pytorch
BERT-pytorch copied to clipboard
the format of input
You mentioned that
NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator
and gave an example:
Welcome to the \t the jungle\n I can stay \t here all night\n
However, the example is actually ONE sentence in one line. Should it be:
Welcome to the jungle \t I can stay here all night\n
(suppose these two sentences are continuous in the broader context)
I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍
I am interested in the prediction of next sentence. If the input data are all continuous sentences, how can the model randomly select 50% for the continuous and 50% for the discontinuous?
And I also think if you mention there need spaces around '\t' is better, unless we will have more vocabs if we don't have spaces.
Yeah, this is clear then.
I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍