odftoolkit icon indicating copy to clipboard operation
odftoolkit copied to clipboard

LO fails to load document after saving with odftoolkit due to invalid UTF-16 entities

Open FlorianBruckner opened this issue 4 years ago • 12 comments
trafficstars

Xalan contains a nasty bug that produces incorrect XML entities in the output, leading to a corrupt document. E.g. this input

<text:span text:style-name="T19">𝜈</text:span>

Is changed to this when saving this document with odftoolkit:

<text:span text:style-name="T19">&#55349;&#57096;</text:span>

More information about the root cause can be found here: https://issues.apache.org/jira/browse/XALANJ-2419

As it seems unlikely that there will ever be a new Xalan release including a fix for this, one option (and that is what I have been doing now) is to replace the xalan serializer dependency with a known good version, e.g.

        <dependency>
            <groupId>org.docx4j.org.apache</groupId>
            <artifactId>xalan-serializer</artifactId>
            <version>11.0.0</version>
        </dependency>

I cannot vouch for the integrity of this package but I have verified that it actually fixes the invalid encoding.

FlorianBruckner avatar Nov 22 '21 16:11 FlorianBruckner

how is this library actually used? i can only find the file odfdom/src/main/java/org/odftoolkit/odfdom/IElementWriter.java which defines an interface but this interface appears to be unused.... probably i'm missing something.

mistmist avatar Nov 23 '21 10:11 mistmist

This library is a replacement for xalan:serializer. The xalan serializer is used to serialize back to XML, and this is what causes my problem.

FlorianBruckner avatar Nov 23 '21 11:11 FlorianBruckner

I also ran into this issue when trying to use the library to export user generated content. User generated content often contains Unicode emojis ("🙂") which trigger this incorrect behavior leading to broken docs.

dgerhardt avatar Jul 13 '23 14:07 dgerhardt

Apache Xalan-Java did a 2.7.3 release in April: https://xalan.apache.org/xalan-j/readme.html#notes_latest There are 7 issues mentioned to be fixed, but not especially close to what you explain. But it is worth a try! In ODF Toolkit refer to the lastest Xalan release alreaedy on the master, the 0.11.0 release still uses 2.7.2, but I did now again a snapshot release 0.12.0-SNAPSHOT, so you might test it in your environments.

If this problem still exist, I would suggest you address this issue to the Apache Xalan developers: https://xalan.apache.org/xalan-j/contact_us.html It might help to check their issue tracker first, write an issue and ask on the mailing list to get a quick response.

Please note, they still seem to use SVN, but have a GitHub Mirror, which is just read-only. Nevertheless, some people have written pull requests and some look like as if they are solutions close to the problem you mentioned: https://github.com/apache/xalan-j/pulls

Good luck! Svante

svanteschubert avatar Jul 13 '23 17:07 svanteschubert

Thanks for the reply, @svanteschubert!

I've tried overriding the Xalan dependency with 2.7.3 but unfortunately the latest version doesn't fix this issue. For now, I've replaced the dependency with the fork by docx4j which fixes it.

Three related issues around this already exist in their tracker and are marked as major bugs, the oldest one has been reported 15 years ago. Looking at the SVN/Git history, it seems like the project has been completely unmaintained for nearly a decade. But since last year, there has been some activity. So I'm slightly hopeful that they will pickup the existing fixes in the near future.

dgerhardt avatar Jul 18 '23 15:07 dgerhardt

@dgerhardt Hi Daniel,

I suggest to write to the Apache Xalan Dev List and list and tell them about the problem and the solution. The more you are able lower the bar of release (their work), the likelier it gets for them to fix it. For instance, the docx4j fork has a solution, you might point to it! Or try to motivate them to overtake that task! :-)

Godspeed, Daniel! Svante

svanteschubert avatar Jul 19 '23 17:07 svanteschubert

@dgerhardt @FlorianBruckner

Could you please do me two favours - perhaps each one? ;-)

  1. Provide a draft PR with a test case reproducing this issue.
  2. Please paste the Apache Xalan issue for this issue, so we can keep track of it.

Thanks in advance. Svante

svanteschubert avatar May 13 '25 14:05 svanteschubert

@svanteschubert I've found three issues on the Xalan tracker which are related to UTF character encoding problems and are probably related to this issue:

  • https://issues.apache.org/jira/browse/XALANJ-2419
  • https://issues.apache.org/jira/browse/XALANJ-2725
  • https://issues.apache.org/jira/browse/XALANJ-2730

dgerhardt avatar May 13 '25 15:05 dgerhardt

@dgerhardt Are you certain that one of them is covering your problem? Just in case you need to extend one of the above or add a new one! :-)

svanteschubert avatar May 13 '25 16:05 svanteschubert

@svanteschubert Since the issues are more focused on the technical side and not the effected use cases I'm not 100% sure if all edge cases are covered but they are definitely related.

I was able to verify that this issue has been fixed (at least for emoji) on Xalan's master branch. I've opened PR #375 with a unit test which succeeds when building against the lastest Xalan snapshot. Unfortunately, there is no release including the fixes.

dgerhardt avatar May 13 '25 17:05 dgerhardt

@dgerhardt Thanks a lot for your patch (it works by failing as expected)! One last wish: Could you please test as well the <text:span text:style-name="T19">&#55349;&#57096;</text:span> from the first post of this issue with the master branch Xalan? Just to make sure that this complete issue would be covered from the next Xalan release (very likely) :-)

svanteschubert avatar May 13 '25 18:05 svanteschubert

@svanteschubert Sure. I've updated the PR. The test for 𝜈 also succeeds with the Xalan snapshot.

dgerhardt avatar May 14 '25 07:05 dgerhardt