pontoon icon indicating copy to clipboard operation
pontoon copied to clipboard

DTD serializer can't serialize message with both ' and "

Open bugzilla-to-github opened this issue 8 years ago • 14 comments

This issue was created automatically by a script.

Bug 1402560

Bug Reporter: @Pike CC: @flodolo, @Delphine, @mathjazz, [email protected]

Pontoon introduced an XML parsing error in https://hg.mozilla.org/l10n-central/sr/rev/9c658940c458#l3.12, where it doesn't escape ' or " when both show up in the message.

bugzilla-to-github avatar Sep 22 '17 15:09 bugzilla-to-github

Comment Author: @Pike

We recently changed stuff about escaping in bug #1193860, I hope we didn't regress things.

bugzilla-to-github avatar Sep 22 '17 16:09 bugzilla-to-github

Comment Author: @flodolo

The history shows me trying to fix the error and failing https://pontoon.mozilla.org/sr/firefox/browser/chrome/browser/safebrowsing/phishing-afterload-warning-message.dtd/?search=safeb.blocked.malwarePage.learnMore&string=171995

Apparently, escaping the " in the string doesn't help either, for now I've used “” to get them builds. CCing Marko too, so he knows what's going on with flod messing with his strings.

bugzilla-to-github avatar Sep 22 '17 16:09 bugzilla-to-github

Comment Author: Marko Andrejić <[email protected]>

Hey Axel and Flod, I apologize for this and thanks for help. I was wondering why some strings I translated was set as suggested and not as accepted. I will change the rest of the strings with “”.

Should I use them from now on instead regular ' and "?

bugzilla-to-github avatar Sep 23 '17 02:09 bugzilla-to-github

Comment Author: @flodolo

(In reply to Marko Andrejić from comment #3)

Hey Axel and Flod, I apologize for this and thanks for help. I was wondering why some strings I translated was set as suggested and not as accepted. I will change the rest of the strings with “”.

Should I use them from now on instead regular ' and "?

No need to apologize, it's not your fault ;-)

Pontoon should escape those "" when exporting, but it's not doing its job.

bugzilla-to-github avatar Sep 23 '17 02:09 bugzilla-to-github

Comment Author: @flodolo

One more case in Breton (this is the fix) https://hg.mozilla.org/l10n-central/br/rev/e553509efb40

bugzilla-to-github avatar Sep 28 '17 22:09 bugzilla-to-github

Comment Author: @mathjazz

It turns out this is a bug in silme. It's been around ever since Pontoon started using it around 2 years ago.

It's reproducible on both, Pontoon's fork and silme upstream:

--

  1. Create a file /path/to/file.dtd with the following content:
  1. Execute this snippet:

import codecs from silme.format.dtd import FormatParser

path = '/path/to/file.dtd'

with codecs.open(path, 'r', 'utf-8') as f: structure = FormatParser.get_structure(f.read()) structure.modify_entity('key', 'single ' and double " quote')

with codecs.open(path, 'w', 'utf-8') as f: f.write(FormatParser.dump_structure(structure))

  1. Output file is corrupted:

--

Is it OK if we make the silme DTD serializer simly replace " with “ for strings that contain both, ' and "?

That shouldn't have any impact on the android strings with quotes, which are taken care of before they reach the serializer: https://github.com/mozilla/pontoon/blob/master/pontoon/sync/formats/silme.py#L130

Properly escaping " (") and ' (') is more complex, because it also requires changes on the parser.

bugzilla-to-github avatar Sep 29 '17 05:09 bugzilla-to-github

Comment Author: @flodolo

(In reply to Matjaz Horvat [:mathjazz] from comment #6)

It turns out this is a bug in silme. It's been around ever since Pontoon started using it around 2 years ago.

How come we haven't seen it in a long time? No strings with both quotes?

Is it OK if we make the silme DTD serializer simly replace " with “ for strings that contain both, ' and "?

Note that you would need to replace "" with “” (opening and closing).

Properly escaping " (") and ' (') is more complex, because it also requires changes on the parser.

Are you sure the parser requires work? I see plenty of escaped quotes in source files, e.g. https://hg.mozilla.org/mozilla-central/file/default/mobile/android/base/locales/en-US/android_strings.dtd#l9

bugzilla-to-github avatar Sep 29 '17 10:09 bugzilla-to-github

Comment Author: @Pike

I don't think it's OK, nor that it's required. We should use " or '. This should be independent of the \ magic we need for Android.

bugzilla-to-github avatar Sep 29 '17 16:09 bugzilla-to-github

Comment Author: @mathjazz

Flod: I guess we were just lucky to not run into those cases before. We handle escaped quotes only for the "mobile/android/base" case.

Axel: What if we apply the Android logic to all other strings and always serialize " as " and ' as ' (and also unescape all variants on read)?

bugzilla-to-github avatar Oct 02 '17 03:10 bugzilla-to-github

Comment Author: @Pike

First up, we need to straighten up what we call escaping in this bug.

In the context of this bug, escaping means " -> ", ' -> ', & -> &. Unescaping means " -> ", and \ -> .

Just came across https://github.com/mathjazz/silme/commit/f4dc4540334a2edfff885ffc1c35d123ab92dfbb which seems to have messed with related concepts?

We shouldn't always escape, I think that's gonna be hell in diffs.

Looking at https://pontoon.mozilla.org/ach/firefox/browser/chrome/overrides/netError.dtd/?search=longDesc2&string=78885, we also don't unescape at all?

Handling this bug would be easier if we actually reported on errors (sic). We need to either do that, or redefine on what a string is in the context of DTD files. Both of which is a healthy amount of code.

I wonder if we can get data on which behavior inside pontoon would be better:

  • mostly raw text as visible in the DTD
  • mostly rendered text as visible in the UI

bugzilla-to-github avatar Oct 02 '17 06:10 bugzilla-to-github

Comment Author: @flodolo

(In reply to Axel Hecht [:Pike] from comment #10)

We shouldn't always escape, I think that's gonna be hell in diffs.

+1 on this. The first landing in Fennec in the last cycle makes diffs unusable.

Handling this bug would be easier if we actually reported on errors (sic). We need to either do that, or redefine on what a string is in the context of DTD files. Both of which is a healthy amount of code.

Errors would help as long as we provide a solution in the error message (like we do for mobile DTDs).

bugzilla-to-github avatar Oct 02 '17 13:10 bugzilla-to-github

Comment Author: @mathjazz

How 'bout we limit the changes only to strings that contain both, " and ', and:

  • escape " as " and ' as ' on serialization
  • unescape " as " and ' as ' on parsing

We'd also need to run a data migration to unescape existing Entity (0) and Translation (127) objects that contain both of these two characters to avoid re-importing them.

That's a pretty simple fix, which avoids messing with the vast majority of strings and changing the general logic of handling the format we're intending to drop anyways.

bugzilla-to-github avatar Oct 04 '17 23:10 bugzilla-to-github

Comment Author: @mathjazz

Lowering priority further, because the bug hits us rarely and the probability decreases as we migrate more strings to FTL.

bugzilla-to-github avatar Jan 04 '18 22:01 bugzilla-to-github

Lowering priority further, because we no longer use DTD files in our priority projects.

mathjazz avatar Apr 04 '24 08:04 mathjazz

This was fixed by #3587.

eemeli avatar Apr 24 '25 12:04 eemeli