pytorch-seq2seq icon indicating copy to clipboard operation
pytorch-seq2seq copied to clipboard

Tutorial 6: byte pair encoding

Open brand17 opened this issue 4 years ago • 1 comments

As far as I know - byte pair encoding is used in the original transformer model. Why don't you use it ?

brand17 avatar May 01 '20 16:05 brand17

It's true, BPE is used in the Transformer and is now quite common in NLP. I didn't use it as I mainly wanted to focus the tutorials on the models themselves and not the pre-processing.

I believe Torchtext has an interface to Sentencepiece which should allow for BPE to be used easily. I'll look into adding it to the tutorials.

bentrevett avatar May 01 '20 19:05 bentrevett