NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Adding new tokens to pretrained asr vocab

Open evilc3 opened this issue 2 years ago • 2 comments

Is it possible to add new tokens to the tokeniser of a pretrained model. say I want to fine tune on some new data, which contains few new tokens eg. punct or nums can I add these to the decoder vocabulary?

Here I don't want to change the entire vocabulary so the initial decoder weights will remain the same hence I can reuse them, but for the new tokens the decoder weights can be randomly initialised and learned during fine-tuning

The only other way I knew to add new tokens was to change the entire tokeniser. but then I am unable to use the decoder.

evilc3 avatar Aug 08 '22 11:08 evilc3

It is not currently possible to partially update the Tokenizer - Sentencepiece itself does not support it. It does support adding extra tokens when the tokenizer is first built but no modification after that step is possible.

You could add a new Tokenizer with exact same number of tokens as the base model, to load the weights for a good init, but you'll lose the performance of the original model.

titu1994 avatar Aug 08 '22 20:08 titu1994

@evilc3 Yes it is possible to add new tokens into sentencepiece tokenizer. Here is a code.

# Installation
# 1. pip install sentencepiece
# 2. pip install protobuf
# 
# 3. To install sentencepiece_model_pb2
#   3.1 git clone https://github.com/google/sentencepiece.git
#   3.2 cd sentencepiece/src
#   3.3 protoc --python_out=. sentencepiece_model.proto

import sys
import sentencepiece_model_pb2 as model
import sentencepiece as spm

if __name__ == "__main__":
    
    mp = model.ModelProto()
    model_file = sys.argv[1]
    out_file = sys.argv[2]

    symbols = [{"name": "<sep>", 
                "id": 1}]

    mp.ParseFromString(open(model_file, 'rb').read())

    print(f'Original model pieces: {len(mp.pieces)}')

    for sym in symbols:
        name = sym["name"]
        i = sym["id"]

        new_sym = mp.SentencePiece()
        new_sym.piece = name 
        new_sym.score = 0.0 # default score for USER_DEFINED
        new_sym.type = 4 # type value for USER_DEFINED
        mp.pieces.insert(i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
        print(f'added {name}...')

    print(f'New model pieces: {len(mp.pieces)}')

    
    with open(out_file, 'wb') as f:
        f.write(mp.SerializeToString())


    text = "<sep> this is a test"
    sp = spm.SentencePieceProcessor(model_file=out_file)
    encoded_ids = sp.encode(text)
    decoded_text = sp.decode(encoded_ids)
    print(f"[text]: {text}")
    print(f"[encoded_ids]: {encoded_ids}")
    print(f"[decoded_text]: {decoded_text}")

naymaraq avatar Aug 10 '22 07:08 naymaraq

Cool hack of the underlying library of Sentencepiece. Guess I can correct my statement to "there is no sanctioned, official way offered by Sentencepiece to add tokens to a prebuilt Tokenizer".

While it is a fun exercise in modifying Sentencepiece please note that Nemo will not support such modifications. You're welcome to write your own wrapping logic though.

titu1994 avatar Aug 10 '22 09:08 titu1994

@naymaraq thanks will try it out.

evilc3 avatar Aug 10 '22 10:08 evilc3