POINTER icon indicating copy to clipboard operation
POINTER copied to clipboard

about constructing data

Open NLPCode opened this issue 4 years ago • 3 comments

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

NLPCode avatar Dec 01 '20 02:12 NLPCode

Thanks for pointing out! We will check this out.

dreasysnail avatar Dec 02 '20 07:12 dreasysnail

Thanks for pointing this out. As the POS part requires word instead of subword, we do shortcut here to use split instead of tokenizer to avoid further matching index between word and subword. We will try to correct this inlanders later version

guoyinwang avatar Dec 17 '20 05:12 guoyinwang

@dreasysnail @guoyinwang In case we use WORD to split text when we prepare the training data. During training process, I want to use subwords to encode the text. How do we align the pair of text for training.

etrigger avatar Jun 09 '21 04:06 etrigger