XML translation drops numbers from translation

Open matsbert opened this issue 3 months ago • 0 comments

When translating XML documents, I have problems with numbers being dropped in the translated result, especially in tables.

Input:

						<row>
							<entry>
								<para>22</para>
							</entry>
							<entry>
								<para>Ventilations- och defrostermunstycken SB styrhytt</para>
							</entry>
						</row>
						<row>
							<entry>
								<para>23</para>
							</entry>
							<entry>
								<para>Luft från intagsaggregatet till ventilationsmunstycken i styrhytt</para>
							</entry>
						</row>
						<row>
							<entry>
								<para>1100</para>
							</entry>
							<entry>
								<para>Luft från intagsaggregatet till ventilationsmunstycken i styrhytt</para>
							</entry>
						</row>

In the table below, you can see how numbers are being dropped from the translated content. This is serious as it is not always easy to spot.

                       <row>
                            <entry>
                                <para>22</para>
                            </entry>
                            <entry>
                                <para>Ventilation and defroster nozzles SB control cabin</para>
                            </entry>
                        </row>
                        <row>
                            <entry>
                                <para>2</para>
                            </entry>
                            <entry>
                                <para>Air from intake unit to ventilation nozzles in control cabin</para>
                            </entry>
                        </row>
		    <row>
                            <entry>
                                <para>11</para>
                            </entry>
                            <entry>
                                <para>Air from intake unit to ventilation nozzles in wheelhouse</para>
                            </entry>
                        </row>

The code I am using:

result = deepl_client.translate_text(
    text,
    tag_handling="xml",
    source_lang="SV",
    target_lang="EN-GB",
    model_type="prefer_quality_optimized",
    non_splitting_tags="div",
    split_sentences="nonewlines"
)

I have a cumbersome workaround for this particular issue, but it's not safe. Pre-processing the content input file as a string with regular expressions that add a 'fake' element around particular numbers and then use the 'ignore_tags' option on this fake element.

This will however not find all all instances of the problem.

Is there anything else that can be done?

Sep 01 '25 13:09 matsbert