xmltools icon indicating copy to clipboard operation
xmltools copied to clipboard

XSL Transformation results in Mojibake and does not write out "encoding" to resulting XML file

Open clang88 opened this issue 2 years ago • 3 comments

I'm using the latest Notepad ++ (8.4) with XML Tools 3.1.1.13.

My XSL starts like this:

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:utils="urn:utils" exclude-result-prefixes="utils"> <xsl:output method="xml" encoding="UTF-8" version="1.0" indent="yes" /> <xsl:strip-space elements="*"/>

What I would expect, is that my originally UTF-8 formatted XML is transformed, maintaining all special characters and encoding="UTF-8" being added to the declaration. What I get is this however:

<?xml version="1.0"?> <mtf> <conceptGrp> <concept>9</concept> <system type="entryClass">Unspecified</system> <languageGrp> <language type="English" lang="EN" /> <descripGrp> <info>Definition--9:</info>gekrümmtes Trassierungselement</descripGrp>

Note, the "encoding" is missing in the declaration. Additionally, the "ü" is displayed as the symbol "xFC" in notepad++ and converted back to "ü" when I copy-paste it into this window here.

Running the same xsl with Notepad ++ 8.1.5 and XML Tools 3.1.1.6 results in following file: <?xml version="1.0" encoding="UTF-8"?> <mtf> <conceptGrp> <concept>9</concept> <system type="entryClass">Unspecified</system> <languageGrp> <language type="English" lang="EN" /> <descripGrp> <info>Definition--9:</info>gekrümmtes Trassierungselement</descripGrp>

The "Umlaut" is, in this version, irrevocabily butchered, but the "encoding" attribute is written to the declaration. I believe this might be a bug, as when I use a different processing engine, results are as expected. Any ideas?

clang88 avatar Apr 29 '22 06:04 clang88

Noone else experiencing this issue? Unfortunatly this makes the XSL Transformation function almost unusuable, because you never know in advance what goes wrong.

I'm no seasoned developer and have no experience with C or C++, but if someone could point me to the XSLT code in the repo I can try and find what potentially is causing this issue.

clang88 avatar Sep 17 '22 18:09 clang88

In the past I haven't used the XSL feature much, but today I happened to need it and I ran head-on into the same issue @clang88 ! For my specific use case, the omission of the encoding in the XML header is not too bad... BUT the character encoding issues (in your example, "ü" displaying as "xFC") are show-stoppers for me.

What appears to be happening is, for some reason, even though the current document (the source for the transformation) has UTF-8 encoding, something (somewhere) gets converted into ANSI (Windows-1252) encoding. This is evidenced by your "ü" becoming xFC (which is its Windows-1252/ansi encoded value). In my use case, my XML contains other punctuation characters -- en-dashes, curly quotes, etc. -- and these all come through the XSL transformation showing up with their Windows-1252 single-byte representations also. My en-dashes show up as x96. Curly apostrophes show up as x92. All of these are the single-byte ANSI encodings for these characters. The output file CLAIMS to be UTF-8 ... but that's why we're seeing x96, x92, xFC, etc... because these bytes don't mean anything in UTF-8.

Any chance someone would be willing to look into this? I will see if I can put together a simple test case.

lowellstewart avatar Feb 13 '23 20:02 lowellstewart

By the way... if anybody else runs into this... my "workaround" is to

  1. run the XSL transformation from XMLTools
  2. on the output file, choose Encoding > ANSI to re-interpret the current file as ANSI instead of UTF-8. (This makes x96 show up as an en-dash, etc., correctly.)
  3. THEN choose Encoding > Convert to UTF-8-bom to ACTUALLY make the file have the desired encoding.
  4. If necessary, add or fix the encoding in the XML file header too.

The above only works for me, because my example XML has characters that are in Windows-1252/ANSI but are encoded differently in UTF-8. If my source file contained other Unicode characters that are outside of Windows-1252, I don't know what would happen -- the workaround would obviously not work though.

lowellstewart avatar Feb 13 '23 20:02 lowellstewart