preprocess
preprocess copied to clipboard
Match regular expression with code comment
This is supposed to expect a '<p>' to separate documents, but the regex looks for any tag, creating issues in non-escaped texts.
Ping @phikoehn @hieuhoang this is just a copy from Moses.
it seems to be a bug but it's been in there since 2010 so I'm wary that there isn't something else going on. https://github.com/moses-smt/mosesdecoder/commits/master/scripts/ems/support/split-sentences.perl I've never worked on this script, @bhaddow seems to know at least something about it
What @lpla suggests seems reasonable, although the sentence splitter was not designed for texts with html markup in them. I don't fully understand the use of
tags (paragraph marking?) I just remove them.