deepl-python icon indicating copy to clipboard operation
deepl-python copied to clipboard

translate_text corrupts HTML

Open pbtsrc opened this issue 2 years ago • 3 comments

text=

<html>
<body>
  <div>
    <a href="01.html">Chapter I. Margaret Makes Herself at Home</a>
  </div>
  <div>
    <a href="02.html">Chapter II. Stephen's Life Goes On</a>
  </div>
</body>
</html>

translate_text(text, source_lang='EN', target_lang='DE', tag_handling='html') for the above text returns this:

<html>
<body>
 <div>
  <a href="01.html">Kapitel I. Margaret macht es sich gemüt</a>lich  </div>
 <div>
  <a href="02.html">Kapitel II. Stephens Leben geht</a>weiter  </div>
</body>
</html>

As you can see the content of <a> has lost its tail (lich, weiter). If we use tag_handling='xml' all works as expected:

<html>
<body>
  <div>
    <a href="01.html">Kapitel I. Margaret macht es sich gemütlich</a>
  </div>
  <div>
    <a href="02.html">Kapitel II. Stephens Leben geht weiter</a>
  </div>
</body>
</html>

If we replace <div> with <p> there will be no issue either.

pbtsrc avatar Apr 28 '23 10:04 pbtsrc

Another example. text=

<p>1-<i>London, Paris</i></p>

translate_text returns:

<p>1-London<i>, Paris</i></p>

Same result with tag_handling='html' and tag_handling='xml'

pbtsrc avatar Apr 28 '23 18:04 pbtsrc

@pbtsrc By chance, are you using both tag_handling and preserve_formatting parameters?

seekuehe avatar Jun 09 '23 13:06 seekuehe

No, I did not use preserve_formatting. I tried to add this parameter, but it did not change anything.

pbtsrc avatar Jun 09 '23 16:06 pbtsrc