character-bert icon indicating copy to clipboard operation
character-bert copied to clipboard

Word-level padding vs Character-level padding

Open IstiaqAnsari opened this issue 3 years ago • 0 comments

Hi @helboukkouri The max numbers of letters in a word is set to 50. So for a word with 5 characters is getting padded to 50. For padding, a value of 260 is used for each character. Then again, to make each sentence in a batch same size we are padding with words. Say to make a sentence of length 5 (5 words) pad to 8, three PAD tokens are being added. In this case each PAD token is also 50 character length but each character is getting a padding value of ZERO. Why are you using two different types of padding? Another thing, after converting each word to ids, you are adding 1 to each id. ( in the file character-bert/utils/character_cnn.py line 125 in the function def convert_word_to_char_ids(self, word: str) -> List[int]: and the comment says # +1 one for masking What is the reason of adding 1 ? Thanks in advance.

IstiaqAnsari avatar Nov 18 '21 11:11 IstiaqAnsari