Gettext
Gettext copied to clipboard
Duplicate translations with different XLIFF unit IDs are not maintained
This is a knock-on issue from https://github.com/oscarotero/Gettext/issues/220. (I hope you won't regret adding that already!)
An XLIFF file may contain several units with the same translation content... (silly example but you get the point)
<unit id="am.permanently">
<segment>
<source>am</source>
<target>soy</target>
</segment>
</unit>
<unit id="am.temporarily">
<segment>
<source>am</source>
<target>estoy</target>
</segment>
</unit>
However, when loading the translation from the XLIFF file, only one gets loaded into $translations
. This is because the two will have the same ID, based only on the source text ("am"), without considering the unit ID.
Thus trying to save the loaded XLIFF again will drop one of the two entries.
@oscarotero, I am happy to code up a solution, but as with https://github.com/oscarotero/Gettext/issues/220 I'd like a little guidance about the best option because I'm not familiar with how the translation IDs are used elsewhere in the library.
Mmm, I don't know how to solve this.
On the one hand, the unit id is not unique, on the other hand, the source is not unique either. I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?
I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?
I don't think XLIFF has any requirements of uniqueness involving the source text. The XLIFF 2.0 spec does talk about uniqueness of IDs, but that's pretty much it. For example, by my read, each unit (=Translation
class instance) can be uniquely identified by a combination of unit ID and file ID (see section 3.1).
I think it'll be necessary to inform the mechanism for ID generation (currently the static generateID
function) based on the format of the file being sourced.
In gettext, we have the following structure:
- domain
- context
- original
Right now, the equivalent in Xliff would be:
- domain (the
id
of the file) - context (the
<node category="context">
element) - original (the
<source>
element)
A possible solution could be to get the id
of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:
# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"
@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards? My understanding of .po files is that msgid
should always be the source language, likewise source
elements in XLIFF.
I was thinking in how to use these translations in the html templates. Using the unit-id as the msgid, you can get the translations by that id:
<h1><?= __('unit-id') ?></h1>
But, at the same time, if you scan this template searching for gettext entries, this translation will have "unit-id" string as the original untranslated string. So, finally, we can assume that the unit id and the source are the same thing (or at least, share the same purpose: to identify a translation).
I don't know, it's hard to fit the Xliff format into gettext without lossing data. I'm open to other possible workaround.
@oscarotero, we already have large parts of an internationalization toolset in place, including a caching layer that'll go in front of Gettext and tools in our PHP code and Smarty templates for getting translations. What I'm hoping to use Gettext for is specifically its XLIFF load/save features -- i.e. just the XLIFF extractor and generator, but not the Translator
class. My operations on the translations always deal with the full list of translations -- i.e. foreach($translations as $translation) {...}
We're introducing Gettext in an attempt to move from our own bespoke XML files to the XLIFF standard. We're using symbolic IDs instead of English-language text in our source code, hence the use of the unit ID in XLIFF. Our goal is to facilitate the use of commodity translation tools like POEdit and Weblate, thus we need the XLIFF content to be usable within those tools (and I think using msgid
or <source>
with the unit ID will break them).
I think the purpose of generateID
etc. is to make it easy for the Translator
class to fetch a translation quickly, correct? I can see two options:
- Disentangle the loading of the translations (the
fromString
method in the Extractor) from the indexing/deduplication of the translations (currently inTranslations::offsetSet
). Currently there's no way to load or save translation files without going through the Translations class, which has built-in resolution of what it considers to be duplicates. (Maybe the indexing/deduplication tools belong in Translator rather than Translations?) - Alternatively, support a richer translation ID that doesn't just consider the segment and source text. Looking up translations by source text alone could continue to return the first match, but going through a list of all translations (
foreach
) would no longer be missing translations.
Maybe the indexing/deduplication tools belong in Translator rather than Translations?
Translation has the getId()
and the static generateId()
methods, so the deduplication belongs to Translation
. This value is important not only to search translations but also to merge different Translations
instances.
The right solution could be add a setId()
method, so you could choose another different way to define the ids/deduplication and use generateId
internally as a fallback. The only problem is Translations::find
that currently depends on this method to find the translations. Removing or changing this method is a breaking change, and I don't want to release a major version right now. And if you define a different id, you can't find a specific translation by context + original. Using a foreach
is discarded for performance reasons.
Maybe we could add the Translation::setId()
method with a warning that using this method makes imposible to search the translation in a Translations
instance using the context + original.
And because Xliff generator use this method to generate ids, include the warning here too.
Thanks, @oscarotero, I submitted this according to the suggestions you made: https://github.com/oscarotero/Gettext/pull/225
I think there is still one issue to resolve, as noted in the PR comment.
From your earlier comment:
A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:
# XLIFF_SOURCE: source-element msgid "unit-id" msgstr "translation"
My reaction was...
@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards?
...but it turns out many projects actually do use PO files written this way (with symbolic IDs in the msgid
), and it's supported in translation tools like Weblate (known there as "Monolingual gettext").
Just documenting something I learned in the hopes it'll be useful to someone else :)
I'm maintaining XLIFF tools for a free site: https://xliff.tools/
Does the "ID Replacement" tool do what's required? If not, I'm interested in learning more about the issue and possibly implementing it.
Thanks!