ml-mdm Create test cases for the Tokenizer

We should create a test case that loads a file, creates a tokenizer https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py and then we assert that the tokenizer was created

Nov 01 '24 14:11 luke-carlson

working on this with @pedroborgescruz

Nov 02 '24 15:11 aaliyahnl

Hi, @luke-carlson! We started working on this issue today. We made progress, but would like to ask you some questions before continuing:

(1) What is the difference between a .vocab file (as defined in the parameter of https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/factory.py#124) and a token_file (as defined in the parameter of the functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#45)? (2) We noticed there are different "modes" for each of the tokenizer functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py. For the function read_dictionary(token_file) in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#78, would it expect to receive any .vocab file that's not either of type BERT or t5, e.g. imagenet.vocab? (3) We have started to write our tests for the tokenizer file. However, we are unsure of how to run the assert tests. We thought of two different cases and were wondering if you could shed some light. Which case below, if any, do you think would be the appropriate way to assert if the tokenizer was created?

CASE 1: We compare factory.py's create_tokenizer() to tokenizer.py's output. Ex.: assert create_tokenizer() == read_dictionary_bert(). In this case, we assume token_file and vocab_file refer to the same file. Consequently, we can compare their outputs for the final assert.
CASE 2: We run the output of factory, and then pass its output as the token file to tokenizer.py's functions. Then, we need something else to assert against. Ex.: assert read_dictionary_bert(create_tokenizer(.vocab)) == ? In this case, we assume that token_file and vocab_file do not refer to the same file. In this case, we are not sure what to compare for the final assert.

Nov 07 '24 21:11 pedroborgescruz

Hey Pedro

looks like we ended up using two different names to refer to the same thing — we should probably rename them to both use vocab_file as the variable so that it matches the cli.
just those vocab options for now
the first case is a reasonable one to make, that'll already increase code coverage of the tokenizer.py file

Nov 13 '24 13:11 luke-carlson

PR: https://github.com/apple/ml-mdm/pull/41 @luke-carlson

Nov 16 '24 19:11 pedroborgescruz

ml-mdm ml-mdm copied to clipboard

Create test cases for the Tokenizer

ml-mdm
ml-mdm copied to clipboard