deepl-node icon indicating copy to clipboard operation
deepl-node copied to clipboard

XML tags with attributes change the order of the translation when tagHandling is XML but are OK when tagHandling is off

Open imanirajdoost opened this issue 2 years ago • 5 comments

Describe the bug I am translating a text using Deepl API that contains XML tags and some of these tags include custom attributes; ex.

That’s the <fontcolor="#007af2">timer</fontcolor>! It measures the time you spend in a module OR the time you have left to complete a challenge!

However, the format of the XML tag is not conserved when the text is translated to Slovenian and Italian (I have not tested in other languages but could be the case for others as well). The result is like this:

Slovenian: To je časovnik <fontcolor="#007af2"></fontcolor> ! Meri čas, ki ga porabite v modulu, ALI čas, ki vam je ostal do konca izziva!

Italian: Questo è il timer <fontcolor="#007af2"></fontcolor> ! Misura il tempo trascorso in un modulo O il tempo rimasto per completare una sfida!

Meaning that instead of putting the word timer inside the tag, it gets out and leaves the tag empty. This happens when the tagHandling option is set to either XML or HTML, however if I set the tagHandling to off, the result is OK but other problems occur for my text because the tagHandling is set to off.

To Reproduce Steps to reproduce the behavior: Can be reproduced in the Deepl API Simulator: https://www.deepl.com/en/docs-api/simulator/

  1. Go to Deepl API Simulator
  2. Copy the text That’s the <fontcolor="#007af2">timer</fontcolor>! It measures the time you spend in a module OR the time you have left to complete a challenge! in the Text field.
  3. Set the target language to Slovenian or Italian
  4. Set tagHandling to XML
  5. Click on Send and compare the results with when the tagHandling is set to off

Expected behavior The correct text should be:

Slovenian: To je <fontcolor="#007af2">časomer</fontcolor>! Meri čas, ki ga porabite v modulu, ALI čas, ki vam je ostal za dokončanje izziva!

Italian: È il <fontcolor="#007af2">timer</fontcolor>! Misura il tempo trascorso in un modulo O il tempo rimanente per completare una sfida!

Which is the case when the tagHandling is set to off but that should not be the case.

What has been tested

I tried combining different options together to see if I can make it work but none of them gave me the intended result. These are the parameters that I changed:

SentenceSplitting=on,off,noNewLines preserveFormatting=on,off nonSplittingTags=fontcolor,null

UPDATE 07/10/2023 11:47 AM The problem seems to be the fact that the API takes into account the ="#007af2" part of the tag as the name of the tag and it doesn't see the closing tag for the same thing. So if we add a space: <fontcolor "=#007af2"> , it will work as expected. I don't know if a fix for that would be necessary but maybe a support for custom attributes like this would be nice.

imanirajdoost avatar Jul 10 '23 09:07 imanirajdoost

Seems like a standard XML must have a space between name and attribute, so I guess the source could be the problem.

imanirajdoost avatar Jul 10 '23 11:07 imanirajdoost

I think the problem with this XML example is that you are using a tag (fontcolor) as an attribute, which is not allowed. The following is an invalid XML document (you can check with various online validators/the XML standard)

<?xml version = "1.0" encoding = "UTF-8"?>
<note>
That’s the <fontcolor="#007af2">timer</fontcolor>!
</note>

This is a valid XML document (I added an attribute col with your color):

<?xml version = "1.0" encoding = "UTF-8"?>
<note>
That’s the <fontcolor col="#007af2">timer</fontcolor>!
</note>

JanEbbing avatar Jul 10 '23 11:07 JanEbbing

@JanEbbing That is correct, however there is an issue when there is an ignoreTag inside another XML tag (which from the XML standard point of view, should be valid).

This example could demonstrate the problem :

Welcome back to the <gs><ignore>[SCHOOL_NAME]</ignore>!</gs> We missed you! Are you ready for this path?

where ignore is added to the ignoreTags list. In this case the result in Slovenian is :

Dobrodošli nazaj na <gs><ignore>[SCHOOL_NAME]</ignore>! Pogrešali smo vas! Ste pripravljeni na to pot?</gs>

Whereas it should be:

Dobrodošli nazaj na <gs><ignore>[SCHOOL_NAME]</ignore>!</gs> Pogrešali smo vas! Ste pripravljeni na to pot?

Meaning that the <gs> tag is not well-placed. Again, if we set tagHandling to off, the <gs> tag will be in its place but the [SCHOOL_NAME] is translated because the tagHandling would not work anymore.

Any ideas for this problem?

imanirajdoost avatar Jul 10 '23 12:07 imanirajdoost

Putting the exclamation mark outside the gs tag fixes this for me (I assume this is because we do sentence splitting, and the exclamation mark that ends this sentence is inside the XML tag), maybe that helps?

Welcome back to the <gs><ignore>[SCHOOL_NAME]</ignore></gs>! We missed you! Are you ready for this path? => Dobrodošli nazaj v <gs><ignore>[SCHOOL_NAME]</ignore></gs>! Pogrešali smo te! Ste pripravljeni na to pot? But I agree this is not a good response from our API, I will check internally with another team.

JanEbbing avatar Jul 10 '23 13:07 JanEbbing

You're right, in this case it will resolve the problem, I suspect that there will be other examples having the same issue, I'll make sure to add them here if I find them to help the team resolve the issue.

imanirajdoost avatar Jul 10 '23 14:07 imanirajdoost