tokenizers
tokenizers copied to clipboard
Cannot load vocabolary for SentencePieceBPETokenizer
I'm doing some experiments with SentencePieceBPETokenizer
and BertWordPieceTokenizer
. I cannot load the vocabulary from the sentence piece vocabulary model "sentencepiece.bpe.model
.
let encoder = (tokenizer) => promisify(tokenizer.encode.bind(tokenizer))
let decoder = (tokenizer) => promisify(tokenizer.decode.bind(tokenizer))
const { SentencePieceBPETokenizer, BertWordPieceTokenizer, ByteLevelBPETokenizer, BPETokenizer } = require("tokenizers");
var encode, decode, encoded, decoded;
const bertTokenizer = await SentencePieceBPETokenizer.fromOptions({ vocabFile: "./sentencepiece.bpe.model" });
encode = encoder(bertTokenizer);
decode = decoder(bertTokenizer);
let encoded = await encode("Who is John?", "John is a teacher");
console.log(encoded.getIds());
let decoded = await decode(encoded.getIds(), true);
console.log(decoded);
In this case I get a empty output for encoded.getIds()
, while If I try to use BertWordPieceTokenizer
with text vocabulary file vocab.txt
like
const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" })
encode = encoder(wordPieceTokenizer);
decode = decoder(wordPieceTokenizer);
let encoded = await encode("Hello, y'all! How are you 😁 ?");
console.log(encoded.getIds());
console.log(encoded.getTokens());
let decoded = await decode(encoded.getIds(), true);
console.log(decoded);
I have the encoding and tokens:
[
101, 19082, 117,
194, 112, 1155,
106, 1293, 1132,
1128, 100, 136,
102
]
[
'[CLS]', 'hello', ',',
'y', "'", 'all',
'!', 'how', 'are',
'you', '[UNK]', '?',
'[SEP]'
]
but no output from decoding.
It seems to be related to
https://github.com/huggingface/tokenizers/issues/291
I have found a way to make it working here