DocumentTranslator-Legacy icon indicating copy to clipboard operation
DocumentTranslator-Legacy copied to clipboard

Microsoft word translation issues with auto-detect source document language

Open jsypkens opened this issue 6 years ago • 3 comments

I'm experiencing two issues that are interrelated a bit, when translating word documents:

Issue 1: The current way the texts are split up works something like this: Using the OpenXml Library, in DocumentTranslationManager.cs, ProcessWordDocument - it's getting a list of texts from the descendants of the document body. This often results in a list of very short phrases, often individual words, due to the markup in the document. Spans/formatting will split sentences up into different elements in the list. The impact of this is that the translated words and phrases lose a bit of meaning because they've lost some context with the words/phrases surrounding them

Issue 2: The auto-detect feature of TranslateArray appears to be applied to all elements of the array passed in, rather than to the individual elements. Combining this issue with 1 above, which is particularly apparent with German documents, entire sections of the document will auto-detect incorrectly as English, and the translation will fail. I can find 3 potential reasons for this

  • There's a mix of English and German words,
  • German words auto-detect incorrectly sometimes (example: Auftragsbestätigungsschreiben, bschlussprüfung - both of these words detect as English, which is also apparent when translating with Bing Translator )
  • TranslateArray applies a single detected language to the entire array

So - a few questions:

  1. Do you have any suggestions on how to better combine document parts into full sentences or phrases to improve the quality of the translation & language detection?
  2. Can you explain how the auto-detect process works with TranslateArray? When looking at DetectArray, it's detecting each element of the array. Experimentally, it appears that TranslateArray is performing the detection on all elements, and applying the language that's the most common to all elements, rather than translating each element with its own detected language. Has Microsoft considered having TranslateArray apply the detected language on each element of the array?

We could potentially explore performing the "DetectArray" on all elements of the document, and then grouping them by language, and translating them together that way - but that won't resolve the issue with failed detection on words like Auftragsbestätigungsschreiben.

jsypkens avatar Apr 03 '18 18:04 jsypkens

Issue 1 could be solved by extracting paragraph instead of text elements, then merging all the text elements of each paragraph into a single string. You can split the paragraph if you would rather detect the source language and translate by sentence.

The downside is that all formatting is lost. Perhaps the TranslateArray2 method, which returns alignment information, could be used to restore the original formatting in the translated string.

TGuiMel avatar Apr 03 '18 22:04 TGuiMel

Right now TranslateArray2 doesn't support neural network translation, which puts a slight hamper on that approach.

For issue 2: yesterday I modified the detection as described above. It now processes the entire document first with "DetectArray", then filters out any array elements where the detected language matches the target language for the translation. Then it passes only the remaining elements through the translation, and updates the documents with the translations for those elements.

It seems like a decent approach, and anecdotally it appears to have resolved my issue with this one German document. We need to see how it performs on a wider variety of documents to determine if it's a scalable approach.

jsypkens avatar Apr 04 '18 14:04 jsypkens

@jsypkens, did your approach succeed? Do you want to make a pull request with your change?

chriswendt1 avatar Jul 14 '19 05:07 chriswendt1