ml-mdm
ml-mdm copied to clipboard
Create test cases for the Tokenizer
We should create a test case that loads a file, creates a tokenizer https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py and then we assert that the tokenizer was created
working on this with @pedroborgescruz
Hi, @luke-carlson! We started working on this issue today. We made progress, but would like to ask you some questions before continuing:
(1) What is the difference between a .vocab file (as defined in the parameter of https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/factory.py#124) and a token_file (as defined in the parameter of the functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#45)? (2) We noticed there are different "modes" for each of the tokenizer functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py. For the function read_dictionary(token_file) in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#78, would it expect to receive any .vocab file that's not either of type BERT or t5, e.g. imagenet.vocab? (3) We have started to write our tests for the tokenizer file. However, we are unsure of how to run the assert tests. We thought of two different cases and were wondering if you could shed some light. Which case below, if any, do you think would be the appropriate way to assert if the tokenizer was created?
-
CASE 1: We compare factory.py's create_tokenizer() to tokenizer.py's output. Ex.: assert create_tokenizer() == read_dictionary_bert(). In this case, we assume token_file and vocab_file refer to the same file. Consequently, we can compare their outputs for the final assert.
-
CASE 2: We run the output of factory, and then pass its output as the token file to tokenizer.py's functions. Then, we need something else to assert against. Ex.: assert read_dictionary_bert(create_tokenizer(.vocab)) == ? In this case, we assume that token_file and vocab_file do not refer to the same file. In this case, we are not sure what to compare for the final assert.
Hey Pedro
- looks like we ended up using two different names to refer to the same thing — we should probably rename them to both use
vocab_fileas the variable so that it matches the cli. - just those vocab options for now
- the first case is a reasonable one to make, that'll already increase code coverage of the tokenizer.py file
PR: https://github.com/apple/ml-mdm/pull/41 @luke-carlson