Open-Assistant get_tokenizer in utils.py needs to be cleaned up

def get_tokenizer(tokenizer_name): if "t5" in tokenizer_name: # rankgen tokenizer = T5Tokenizer.from_pretrained(tokenizer_name, truncation_side="left") else: tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) if "galactica" in tokenizer_name: tokenizer.add_special_tokens({"pad_token": "", "eos_token": ""})

return tokenizer

The line of code if "galactica" in tokenizer_name: will not be executed, either it should be removed, or the "if .. else " branch logic needs to be re-written according to the the original design

Jan 29 '23 04:01 henryj18

On second thought, it is not accurate to say that line of code will not be executed, but still the logic needs to be cleaned up, so that only one branch will be executed

Jan 29 '23 13:01 henryj18

Perhaps just add an "elif"? So "if ... elif ... else"

Jan 29 '23 23:01 henryj18

Hey @andreaskoepf and @henryj18, I'm new around here and would like to give this issue a go. I was thinking of implementing some light configuration classes for different model classes as well as an explicit types of models [t5, galactic, etc] to avoid the magic strings.

Do you have any advice or thoughts on how I should proceed?

Jan 30 '23 18:01 ghost

Hey @andreaskoepf and @henryj18, I'm new around here and would like to give this issue a go. I was thinking of implementing some light configuration classes for different model classes as well as an explicit types of models [t5, galactic, etc] to avoid the magic strings.

Do you have any advice or thoughts on how I should proceed?

The original issue is just to fix a flaw in the "if... else" and one extra "if......", so it should be straightforward change to an "if ..... elif .... else".

Could you elaborate a little but your "light configuration classes" and how it is related to the original issue? Thx

Jan 30 '23 18:01 henryj18

Hey, yes sure, regarding the configuration classes, I mean having data structures of tokeniser configuration so that new model types could be added through additional configuration options rather than changing the logic of the get_tokeniser function. I think that this would also solve the 'galactica' branching problem.

instead of

def get_tokenizer(tokenizer_name):
    if "t5" in tokenizer_name:  # rankgen
        tokenizer = T5Tokenizer.from_pretrained(tokenizer_name, truncation_side="left")
    else:
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    if "galactica" in tokenizer_name:
        tokenizer.add_special_tokens({"pad_token": "<pad>", "eos_token": "</s>"})
    return tokenizer

something like

# dictionaries or dataclasses with defaults
tokeniser_configs = {
   't5': {'class': T5Tokenizer, 'kwargs': {truncation_side="left"}},
   'galactica': {'class': AutoTokenizer, 'kwargs': {}, 'special_tokens': {"pad_token": "<pad>", "eos_token": "</s>"}},
   ...
}

def get_tokenizer(tokenizer_name):
    tokenizer_config = ... # match tokenizer_name to known model class
    tokenizer = tokenizer_config['class'].from_pretrained(tokenizer_name, **tokenizer_config['kwargs'])
    if "special_tokens" in tokenizer_config: # or hasattr for dataclass
	tokenizer.add_special_tokens(tokenizer_config['special_tokens'])				
    return tokenizer

This change resolves the original branching issue and also have a nice side effect of separating model-specific data from the logic of get_tokenizer.

Jan 30 '23 19:01 ghost

Changed account to this username

Feb 01 '23 11:02 jackapbutler

Great job! On retrospect, I suggest we streamline the process of opening/managing github issues in the future. We should avoid opening another issue (in this case 1262) just to fix the exact same issue (in this case 990), instead, the original issue (990) should be worked on. That can help reduce the complexity of managing github issues and potentially save some repeated work

Feb 12 '23 14:02 henryj18

Open-Assistant Open-Assistant copied to clipboard

get_tokenizer in utils.py needs to be cleaned up

Open-Assistant
Open-Assistant copied to clipboard