Open-Assistant
Open-Assistant copied to clipboard
get_tokenizer in utils.py needs to be cleaned up
def get_tokenizer(tokenizer_name):
if "t5" in tokenizer_name: # rankgen
tokenizer = T5Tokenizer.from_pretrained(tokenizer_name, truncation_side="left")
else:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
if "galactica" in tokenizer_name:
tokenizer.add_special_tokens({"pad_token": "
return tokenizer
The line of code if "galactica" in tokenizer_name: will not be executed, either it should be removed, or the "if .. else " branch logic needs to be re-written according to the the original design
On second thought, it is not accurate to say that line of code will not be executed, but still the logic needs to be cleaned up, so that only one branch will be executed
Perhaps just add an "elif"? So "if ... elif ... else"
Hey @andreaskoepf and @henryj18, I'm new around here and would like to give this issue a go. I was thinking of implementing some light configuration classes for different model classes as well as an explicit types of models [t5, galactic, etc] to avoid the magic strings.
Do you have any advice or thoughts on how I should proceed?
Hey @andreaskoepf and @henryj18, I'm new around here and would like to give this issue a go. I was thinking of implementing some light configuration classes for different model classes as well as an explicit types of models [t5, galactic, etc] to avoid the magic strings.
Do you have any advice or thoughts on how I should proceed?
The original issue is just to fix a flaw in the "if... else" and one extra "if......", so it should be straightforward change to an "if ..... elif .... else".
Could you elaborate a little but your "light configuration classes" and how it is related to the original issue? Thx
Hey, yes sure, regarding the configuration classes, I mean having data structures of tokeniser configuration so that new model types could be added through additional configuration options rather than changing the logic of the get_tokeniser
function. I think that this would also solve the 'galactica' branching problem.
instead of
def get_tokenizer(tokenizer_name):
if "t5" in tokenizer_name: # rankgen
tokenizer = T5Tokenizer.from_pretrained(tokenizer_name, truncation_side="left")
else:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
if "galactica" in tokenizer_name:
tokenizer.add_special_tokens({"pad_token": "<pad>", "eos_token": "</s>"})
return tokenizer
something like
# dictionaries or dataclasses with defaults
tokeniser_configs = {
't5': {'class': T5Tokenizer, 'kwargs': {truncation_side="left"}},
'galactica': {'class': AutoTokenizer, 'kwargs': {}, 'special_tokens': {"pad_token": "<pad>", "eos_token": "</s>"}},
...
}
def get_tokenizer(tokenizer_name):
tokenizer_config = ... # match tokenizer_name to known model class
tokenizer = tokenizer_config['class'].from_pretrained(tokenizer_name, **tokenizer_config['kwargs'])
if "special_tokens" in tokenizer_config: # or hasattr for dataclass
tokenizer.add_special_tokens(tokenizer_config['special_tokens'])
return tokenizer
This change resolves the original branching issue and also have a nice side effect of separating model-specific data from the logic of get_tokenizer
.
Changed account to this username
Great job! On retrospect, I suggest we streamline the process of opening/managing github issues in the future. We should avoid opening another issue (in this case 1262) just to fix the exact same issue (in this case 990), instead, the original issue (990) should be worked on. That can help reduce the complexity of managing github issues and potentially save some repeated work