omegat
omegat copied to clipboard
Use Jackson XML dataformat for JAXB data parser
Pull request type
Please check the type of change your PR introduces:
- [ ] Bug fix
- [x] Feature
- [ ] Documentation
- [ ] Build and release changes
- [ ] Other (describe below)
Which ticket is resolved?
-
Link: https://sourceforge.net/p/omegat/feature-requests/1582/
-
Title: Use Jackson library as XML parser
-
Link: https://sourceforge.net/p/omegat/feature-requests/1268/
-
Title: Improve TBX parsing
What does this PR change?
- Replace JAXB parser with Jackson XML parser
- Patch generated definition of gen.core.project.Masks to add annotation of @JacksonXmlElementWrapper(useWrapping=false) because of known issue.
- Update TMXWriterTest to check with system line separator
It is because XML parser can automatically change /r/n into
XML entity

/n
whenSystem.lineSeparator()
is /n - Update
TMXReader2
to detect both "lang" and "xml:lang" some implementation return "lang" from "xml:lang" but other can return "xml:lang" as is. - Drop JAXB parser from GlossaryReaderTBX class.
- Use jackson for TaasClient class
Other information
TODO
- [x] TBX: there are difference in comment content
- [ ] ~~TAAS: test~~
- [x] SRX: implement
I am 100% in favour of removing any dependency to JAXB: it is slow and highly memory consuming, and not anymore present in standard JRE after Java >= 9 But replacing it by another object mapping library solves only the last problem. In DGT-OmegaT I progressively reduced dependency to JAXB but replaced it by complete StaX analysis: code is longer but all is in the same place. If you want to see the impact in memory, you can compare DGT-OmegaT 3.4 with OmegaT 5.8 while loading a big TBX (glossary) file. Another advantage is that it removes also dependency to XML Schema, which makes possible to support formats with variants. For example, when I rewrote support of TBX in StaX, I also added basic support for TBX 3.0, whose markups are significantly different from TBX 2.0: that would have been more difficult to support two different schemas in JAXB.
Last point. After I removed all dependencies in the core of DGT-OmegaT, I later discovered a very last indirect dependency: LanguageTool uses segment-2.0.jar which uses JAXB. If we can force segmenter to use StaX (it can, but I did not understand how to force it) that would enable to definitively remove JAXB jars from OmegaT.
We already have jackson library for JSON parsing to use for MT API access. It is why I try to use Jackson XML binding to replace JAXB dependency. Jackson XML bind use StaX in it.
It works perfect to replace to parsing XML based on JAXB annotation, ~~but it makes sub-effect to TMX parser that use XMLInputStream class.~~
@t-cordonnier Thank you for head me up to GlossaryReaderTBX class.
There is no TAAS related test case in the project. Does anyone help testing TAAS?
@t-cordonnier Thank you for head me up to GlossaryReaderTBX class.
Good that you finally found my code using StaX. This should be a significant improvement when loading very big TBX files. And now it has basic support for TBX 3.0 (for example at line 85, "termEntry" is for TBX 2.0 while "conceptEntry" is for TBX 3.0). There is still work for a complete support of TBX 3.0 (actually only "DCA" style is supported) but we should add the info in ticket sourceforge 1599 once this code will be in master branch)
@t-cordonnier Thank you for head me up to GlossaryReaderTBX class.
You should have a look to org.omegat.core.segmentation.SRX class : this one also uses JAXB and should be migrated to Jackson (personally I prefer with StaX but here files are not so big, so if you prefer with Jackson, no problem)
There is no TAAS related test case in the project. Does anyone help testing TAAS?
TAAS is a server, that is not so easy to test code related to a server: it implies new queries each time you launch the test, and this may temporary break the full build chain if the server is temporarily unavailable.
@t-cordonnier Stax filter become broken when jackson XML bind enabled. Could you explain why stax filter behavior is changed and OpenXML filter and XLIFF filter become broken?
org.junit.ComparisonFailure: expected:<[This is first line.]> but was:<[[Stax Event #4]]>
Remaining 7 errors are all in stax filter test.
I'd like to move a commit that remove legacy generated jaxb file as another pull-request for helping review easier.
Could you explain why stax filter behavior is changed and OpenXML filter and XLIFF filter become broken?
My first impression is that there is a moment where the filter receives an event rather than a string.
StaX is a generic library, like JCache (to take an example we recently spoke about): it provides a set of interfaces which can potentially have multiple implementations. My filters did not specify which implementation to use. Actually they use the implementation provided in the JDK. It is probable that Jackson also contains an implementation of StaX which becomes the default as soon as it is present. You can check how to force a dedicated implementation of StaX, or I will do it later
@t-cordonnier Jackson provide woodstox
A change here depends on jackson XML parser that use woodstox StaX parser to read omegat.project
file.
Test failed; org.omegat.filters4.Xliff2FilterTest.testKey(Xliff2FilterTest.java:65)
expected: //groups/N65541xdocument/N65541bxmarksection-1/title-2
actual: //groups/N65541xdocument/N65541xdocument/N65541bxmarksection-1/title-2
@t-cordonnier Stax filter become broken when jackson XML bind enabled. Could you explain why stax filter behavior is changed and OpenXML filter and XLIFF filter become broken?
Hi Hiroshi
Sorry that I did not come back earlier. I had given to you a first answer where I explained that the change was due to differences between Woodstox and native java StaX implementation (Xerces). Later I did my own corrections but I did not want to send before it was finished.
Here is what I did: https://github.com/omegat-org/omegat/commit/15f858b97400710055c11e35c80eb296c12dcc8f For OpenXML we are quite in the same direction. I would prefer to decide about the "else" / default behaviour later, since OpenXML can contain lot of different things for which the required behavior could differ. For AbstractXmlFilter I would like to avoid using Woodstox-specific properties, in case we decide to use another later. If you look in my commit you will see that I have found a implementation-independent solution.
Test failed; org.omegat.filters4.Xliff2FilterTest.testKey(Xliff2FilterTest.java:65)
Yes, the patch I propose does also correct this.
@t-cordonnier I believe I can integrate your #24 change into 980bc62 and 3ebf2e3 . Could you check it works?
@t-cordonnier I believe I can integrate your #24 change into 980bc62 and 3ebf2e3 . Could you check it works?
Hi Hiroshi
I wanted to alert you about the fact that these two features had impact on same classes and ask you to first have a look to #24, probably my fault. I would prefer to see my original commits (today I rebased them from omegat's master) in the history, rather than glued in another feature. It will make easier to refer to them if somebody asks about SRX feature. I am a little bit surprised that it is not the case in your branch since you said that you switched to a "merge commit" strategy. If you don't mind, I would like to push my branch in the master and it should be possible to rebase your branch from it, then we see better which change is about SRX and which one is about Jackson. Then I will do a test with your new branch.
Is it ok for you?
Sorry that I did not tell before.
Regards Thomas
@t-cordonnier OK, I'd like to remove changes to SRX from here. Could you please raise individual PR ?
Ok, this is #538 Once you can push it to master, you can do the chaanges about SRX in #239, but only the part about migration from JAXB to Jackson, thanks.