POINTER about constructing data

about constructing data

Open NLPCode opened this issue 4 years ago • 3 comments

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

Dec 01 '20 02:12 NLPCode

Thanks for pointing out! We will check this out.

Dec 02 '20 07:12 dreasysnail

Thanks for pointing this out. As the POS part requires word instead of subword, we do shortcut here to use split instead of tokenizer to avoid further matching index between word and subword. We will try to correct this inlanders later version

Dec 17 '20 05:12 guoyinwang

@dreasysnail @guoyinwang In case we use WORD to split text when we prepare the training data. During training process, I want to use subwords to encode the text. How do we align the pair of text for training.

Jun 09 '21 04:06 etrigger

POINTER POINTER copied to clipboard

about constructing data

POINTER
POINTER copied to clipboard