preprocess icon indicating copy to clipboard operation
preprocess copied to clipboard

Match regular expression with code comment

Open lpla opened this issue 5 years ago • 3 comments

This is supposed to expect a '<p>' to separate documents, but the regex looks for any tag, creating issues in non-escaped texts.

lpla avatar Nov 06 '19 09:11 lpla

Ping @phikoehn @hieuhoang this is just a copy from Moses.

kpu avatar Nov 07 '19 03:11 kpu

it seems to be a bug but it's been in there since 2010 so I'm wary that there isn't something else going on. https://github.com/moses-smt/mosesdecoder/commits/master/scripts/ems/support/split-sentences.perl I've never worked on this script, @bhaddow seems to know at least something about it

hieuhoang avatar Nov 07 '19 04:11 hieuhoang

What @lpla suggests seems reasonable, although the sentence splitter was not designed for texts with html markup in them. I don't fully understand the use of

tags (paragraph marking?) I just remove them.

bhaddow avatar Nov 07 '19 11:11 bhaddow