Opus-MT icon indicating copy to clipboard operation
Opus-MT copied to clipboard

Support multiline text.

Open andchir opened this issue 4 years ago • 4 comments

andchir avatar Mar 27 '22 18:03 andchir

I am not sure if this patch is a good thing. What happens if sentences run over one line and we don't want to split on newlines? Is that a problem?

jorgtied avatar Apr 07 '22 13:04 jorgtied

@jorgtied This is the code before my changes:

sentSource = self.sentence_splitter([self.normalizer(srctxt)])

As you can see, the string turns into a list. That is, in any case, a list is used. I tested and got just the same result as if I just cut out the newline ("\r"). I don't know how better, you can just remove the newline ("\r"). But then you will need to deal with extra spaces. I think using a list is more reliable. In my code there is a filtering of empty lines in a list.

andchir avatar Apr 07 '22 19:04 andchir

Hello, interesting feature however I was wondering if it is not safer to do something like this ,

normalized_text = '\n'.join(self.normalizer(line) for line in src_txt.split('\n'))   # normalizer do not accept '\n'
sentSource = self.sentence_splitter([normalized_text])

Since the splitter accepts \n and it might be a valuable information during sentence splitting , also it will mitigate the problem that mention @jorgtied.

With this approach "A very long line for testing\n that run over two lines." will be only one sentence.

ml-addastra-io avatar Apr 28 '22 22:04 ml-addastra-io

I pushed a solution similar to @ml-addastra-io 's suggestion. I hope it doesn't break anything and fixes the multiline problem. Could you confirm that this solves the issue? Thanks!

jorgtied avatar May 19 '22 18:05 jorgtied