Gettext Duplicate translations with different XLIFF unit IDs are not maintained

This is a knock-on issue from https://github.com/oscarotero/Gettext/issues/220. (I hope you won't regret adding that already!)

An XLIFF file may contain several units with the same translation content... (silly example but you get the point)

<unit id="am.permanently">
  <segment>
    <source>am</source>
    <target>soy</target>
  </segment>
</unit>
<unit id="am.temporarily">
  <segment>
    <source>am</source>
    <target>estoy</target>
  </segment>
</unit>

However, when loading the translation from the XLIFF file, only one gets loaded into $translations. This is because the two will have the same ID, based only on the source text ("am"), without considering the unit ID.

Thus trying to save the loaded XLIFF again will drop one of the two entries.

Sep 16 '19 23:09 asmecher

@oscarotero, I am happy to code up a solution, but as with https://github.com/oscarotero/Gettext/issues/220 I'd like a little guidance about the best option because I'm not familiar with how the translation IDs are used elsewhere in the library.

Sep 16 '19 23:09 asmecher

Mmm, I don't know how to solve this.

On the one hand, the unit id is not unique, on the other hand, the source is not unique either. I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

Sep 17 '19 21:09 oscarotero

I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

I don't think XLIFF has any requirements of uniqueness involving the source text. The XLIFF 2.0 spec does talk about uniqueness of IDs, but that's pretty much it. For example, by my read, each unit (=Translation class instance) can be uniquely identified by a combination of unit ID and file ID (see section 3.1).

I think it'll be necessary to inform the mechanism for ID generation (currently the static generateID function) based on the format of the file being sourced.

Sep 17 '19 21:09 asmecher

In gettext, we have the following structure:

domain
context
original

Right now, the equivalent in Xliff would be:

domain (the id of the file)
context (the <node category="context"> element)
original (the <source> element)

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

Sep 17 '19 22:09 oscarotero

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards? My understanding of .po files is that msgid should always be the source language, likewise source elements in XLIFF.

Sep 17 '19 23:09 asmecher

I was thinking in how to use these translations in the html templates. Using the unit-id as the msgid, you can get the translations by that id:

<h1><?= __('unit-id') ?></h1>

But, at the same time, if you scan this template searching for gettext entries, this translation will have "unit-id" string as the original untranslated string. So, finally, we can assume that the unit id and the source are the same thing (or at least, share the same purpose: to identify a translation).

I don't know, it's hard to fit the Xliff format into gettext without lossing data. I'm open to other possible workaround.

Sep 18 '19 08:09 oscarotero

@oscarotero, we already have large parts of an internationalization toolset in place, including a caching layer that'll go in front of Gettext and tools in our PHP code and Smarty templates for getting translations. What I'm hoping to use Gettext for is specifically its XLIFF load/save features -- i.e. just the XLIFF extractor and generator, but not the Translator class. My operations on the translations always deal with the full list of translations -- i.e. foreach($translations as $translation) {...}

We're introducing Gettext in an attempt to move from our own bespoke XML files to the XLIFF standard. We're using symbolic IDs instead of English-language text in our source code, hence the use of the unit ID in XLIFF. Our goal is to facilitate the use of commodity translation tools like POEdit and Weblate, thus we need the XLIFF content to be usable within those tools (and I think using msgid or <source> with the unit ID will break them).

I think the purpose of generateID etc. is to make it easy for the Translator class to fetch a translation quickly, correct? I can see two options:

Disentangle the loading of the translations (the fromString method in the Extractor) from the indexing/deduplication of the translations (currently in Translations::offsetSet). Currently there's no way to load or save translation files without going through the Translations class, which has built-in resolution of what it considers to be duplicates. (Maybe the indexing/deduplication tools belong in Translator rather than Translations?)
Alternatively, support a richer translation ID that doesn't just consider the segment and source text. Looking up translations by source text alone could continue to return the first match, but going through a list of all translations (foreach) would no longer be missing translations.

Sep 18 '19 15:09 asmecher

Maybe the indexing/deduplication tools belong in Translator rather than Translations?

Translation has the getId() and the static generateId() methods, so the deduplication belongs to Translation. This value is important not only to search translations but also to merge different Translations instances.

The right solution could be add a setId() method, so you could choose another different way to define the ids/deduplication and use generateId internally as a fallback. The only problem is Translations::find that currently depends on this method to find the translations. Removing or changing this method is a breaking change, and I don't want to release a major version right now. And if you define a different id, you can't find a specific translation by context + original. Using a foreach is discarded for performance reasons.

Maybe we could add the Translation::setId() method with a warning that using this method makes imposible to search the translation in a Translations instance using the context + original.

And because Xliff generator use this method to generate ids, include the warning here too.

Sep 18 '19 21:09 oscarotero

Thanks, @oscarotero, I submitted this according to the suggestions you made: https://github.com/oscarotero/Gettext/pull/225

I think there is still one issue to resolve, as noted in the PR comment.

Sep 19 '19 01:09 asmecher

From your earlier comment:

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:
# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

My reaction was...

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards?

...but it turns out many projects actually do use PO files written this way (with symbolic IDs in the msgid), and it's supported in translation tools like Weblate (known there as "Monolingual gettext").

Just documenting something I learned in the hopes it'll be useful to someone else :)

Sep 19 '19 21:09 asmecher

I'm maintaining XLIFF tools for a free site: https://xliff.tools/

Does the "ID Replacement" tool do what's required? If not, I'm interested in learning more about the issue and possibly implementing it.

Thanks!

Jan 19 '22 18:01 ShawnMilo

Gettext Gettext copied to clipboard

Duplicate translations with different XLIFF unit IDs are not maintained

Gettext
Gettext copied to clipboard