NeMo
NeMo copied to clipboard
Adding new tokens to pretrained asr vocab
Is it possible to add new tokens to the tokeniser of a pretrained model. say I want to fine tune on some new data, which contains few new tokens eg. punct or nums can I add these to the decoder vocabulary?
Here I don't want to change the entire vocabulary so the initial decoder weights will remain the same hence I can reuse them, but for the new tokens the decoder weights can be randomly initialised and learned during fine-tuning
The only other way I knew to add new tokens was to change the entire tokeniser. but then I am unable to use the decoder.
It is not currently possible to partially update the Tokenizer - Sentencepiece itself does not support it. It does support adding extra tokens when the tokenizer is first built but no modification after that step is possible.
You could add a new Tokenizer with exact same number of tokens as the base model, to load the weights for a good init, but you'll lose the performance of the original model.
@evilc3 Yes it is possible to add new tokens into sentencepiece tokenizer. Here is a code.
# Installation
# 1. pip install sentencepiece
# 2. pip install protobuf
#
# 3. To install sentencepiece_model_pb2
# 3.1 git clone https://github.com/google/sentencepiece.git
# 3.2 cd sentencepiece/src
# 3.3 protoc --python_out=. sentencepiece_model.proto
import sys
import sentencepiece_model_pb2 as model
import sentencepiece as spm
if __name__ == "__main__":
mp = model.ModelProto()
model_file = sys.argv[1]
out_file = sys.argv[2]
symbols = [{"name": "<sep>",
"id": 1}]
mp.ParseFromString(open(model_file, 'rb').read())
print(f'Original model pieces: {len(mp.pieces)}')
for sym in symbols:
name = sym["name"]
i = sym["id"]
new_sym = mp.SentencePiece()
new_sym.piece = name
new_sym.score = 0.0 # default score for USER_DEFINED
new_sym.type = 4 # type value for USER_DEFINED
mp.pieces.insert(i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
print(f'added {name}...')
print(f'New model pieces: {len(mp.pieces)}')
with open(out_file, 'wb') as f:
f.write(mp.SerializeToString())
text = "<sep> this is a test"
sp = spm.SentencePieceProcessor(model_file=out_file)
encoded_ids = sp.encode(text)
decoded_text = sp.decode(encoded_ids)
print(f"[text]: {text}")
print(f"[encoded_ids]: {encoded_ids}")
print(f"[decoded_text]: {decoded_text}")
Cool hack of the underlying library of Sentencepiece. Guess I can correct my statement to "there is no sanctioned, official way offered by Sentencepiece to add tokens to a prebuilt Tokenizer".
While it is a fun exercise in modifying Sentencepiece please note that Nemo will not support such modifications. You're welcome to write your own wrapping logic though.
@naymaraq thanks will try it out.