character-bert
character-bert copied to clipboard
Word-level padding vs Character-level padding
Hi @helboukkouri
The max numbers of letters in a word is set to 50. So for a word with 5 characters is getting padded to 50.
For padding, a value of 260 is used for each character.
Then again, to make each sentence in a batch same size we are padding with words. Say to make a sentence of length 5 (5 words) pad to 8, three PAD tokens are being added. In this case each PAD token is also 50 character length but each character is getting a padding value of ZERO.
Why are you using two different types of padding?
Another thing, after converting each word to ids, you are adding 1 to each id. ( in the file character-bert/utils/character_cnn.py
line 125 in the function def convert_word_to_char_ids(self, word: str) -> List[int]:
and the comment says # +1 one for masking
What is the reason of adding 1 ?
Thanks in advance.