udpipe icon indicating copy to clipboard operation
udpipe copied to clipboard

How to send multiple tokenized tokens in multiple sentences in multiple paragraphs to UDPIPE to parse?

Open fishfree opened this issue 6 months ago • 1 comments

UDPipe Chinese model is so bad at tokenization. I need to manaually seperate a doc into multiple paragraphs, then iteratively seperate each paragraph into multiple sentences, then iteratively tokenize each sentence with jiebaR. Then I need to feed the result into udpipe to go on tagging and parsing. I read the official documentation and tried a lot, no luck. I'm not familiar with R.

Many thanks!

fishfree avatar Jul 07 '25 23:07 fishfree

If your text is already tokenized: see https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html

Image

jwijffels avatar Jul 14 '25 21:07 jwijffels