omegat
omegat copied to clipboard
[BUGS#1038] MsOfficeFileFilter does not pass the regression test case posted in BUGS#1038
Add reproducible test case for BUGS#1038
Pull request type
Please mark github LABEL of the type of change your PR introduces:
- Bug fix -> [bug]
Which ticket is resolved?
- Bugs: https://sourceforge.net/p/omegat/bugs/1038/
- Option "Remove leading and trailing tags" removes tags that are not a tag pair
What does this PR change?
- Add TWO test cases with the same source file
- ~~org.omegat.filters.OpenXMLFilterTest#testParseTags~~ -> unconditionally ignored/skipped
- org.omegat.filters4.OpenXMLFilterTest#testParseTags
Other information
The test data which posted in bug ticket contains document.xml
like;
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="w14 wp14">
<w:body>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr> <w:b/> <w:b/> <w:bCs/> </w:rPr> </w:pPr>
<w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow has a black cat.</w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
<w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow</w:t> </w:r>
<w:r> <w:rPr></w:rPr> <w:t xml:space="preserve"> has a </w:t>
</w:r> <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>black cat.</w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
<w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow</w:t> </w:r>
<w:r> <w:rPr></w:rPr> <w:t xml:space="preserve"> has a black cat.</w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
</w:p>
<w:p> <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
<w:r> <w:rPr></w:rPr> <w:t xml:space="preserve">The black widow has a </w:t> </w:r>
<w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>black cat.</w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
<w:r> <w:rPr></w:rPr> </w:r> </w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr> <w:b/> <w:b/> <w:bCs/> </w:rPr> </w:pPr>
<w:r>
<w:rPr> <w:b/> <w:bCs/> </w:rPr>
<w:t xml:space="preserve">The black widow has a </w:t> </w:r>
<w:r>
<w:rPr> <w:b/> <w:bCs/> <w:i/> <w:iCs/> </w:rPr> <w:t>black</w:t>
</w:r>
<w:r>
<w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t xml:space="preserve"> cat.</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:type w:val="nextPage"/>
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:left="1134" w:right="1134" w:header="0" w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0"/>
<w:pgNumType w:fmt="decimal"/>
<w:formProt w:val="false"/>
<w:textDirection w:val="lrTb"/>
</w:sectPr>
</w:body>
</w:document>
(indent by @miurahr)
@t-cordonnier You can find an assertion error such as
Expect: [file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<p0>The black widow</p0> has a <p1>black cat.</p1>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a black cat.']
Actual:[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<e0/>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a <p1>black cat.</p1>']
There may be an unexpected source='<e0/>'
produced by MsOfficeFileFilter.
@didierbr Could you check expectations as a bug ticket follower?
See Thomas comment in the bug report.
See Thomas comment in the bug report.
now ignore the test on old filter.
@t-cordonnier it still be failed with new MsOfficeFileFilter
CI error summary is here https://dev.azure.com/omegat-org/OmegaT/_build/results?buildId=6815&view=ms.vss-test-web.build-test-results-tab&runId=1004201&resultId=100327&paneView=debug
For convenience, copied to here
java.lang.AssertionError:
expected:<[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<p0>The black widow</p0> has a <p1>black cat.</p1>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a black cat.']>
but was:<[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<e0/>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a <p1>black cat.</p1>']>
expected:
<[...] source='<p0>The black widow</p0> has a <p1>black cat.</p1>',
but was:<[...] source='<e0/>',
Sounds logical to me. In line 113 you expect the next segment to be <e0/>
but in line 114 you expect that this segment with <e0/>
has been bypassed
expected:<[...] source='The black widow has a black cat.', but was:<[...] source='',
Sounds logical to me. In line 113 you expect the next segment to be
<e0/>
but in line 114 you expect that this segment with<e0/>
has been bypassed
@t-cordonnier
It is good things that sound logical to you.
My expectation is <p0>The black widow</p0> has a <p1>black cat.</p1>
You think it is logical that MsOfficeFileFilter produce <e0/>
. So what?
Is an expectation come from bug report wrong? Or any potential defeat in the filter?
The 3 first segments in your document look like this:
-
The black widow has a black cat.
-
<e0/>
-
<p0>The black widow</p0> has a <p1>black cat.</p1>
At line 113 you check segment 1, preceeded by null and followed by <e0/>
which is correct.
But at line 114 you expect that the next segment contains <p0>The black widow</p0> has a <p1>black cat.</p1>
Between line 113 and 114 you forgot to test segment 2! So, your test is comparing segment 2 with 3, and of course they are not equal!
In case you ask yourself where segment 2 comes from, look at your XML : each <w:p>
will generate one segment, even if the second one contains no text.
On the other hand, you should have noticed that in line 113 next = <e0/>
so the following segment to be tested could not be anything else than <e0/>
Now to be complete I suggest that your test also checks the following cases:
- A paragraph
<w:p>
containing multiple<w:r>
with text but no<w:rPr>
- A paragraph
<w:p>
containing multiple<w:r>
with text but empty<w:rPr>
- A paragraph
<w:p>
containing multiple<w:r>
whose<w:rPr>
are identical - A paragraph
<w:p>
containing multiple<w:r>
whose first and last<w:rPr>
are identical, but a different one in the middle
I don't know in which circonstances Word can produce these cases, but they are not forbidden by the specification. Normally the filter does some tag compression in the 3 first cases, good occasion to test.
Now to be complete I suggest that your test also checks the following cases:
- A paragraph
<w:p>
containing multiple<w:r>
with text but no<w:rPr>
- A paragraph
<w:p>
containing multiple<w:r>
with text but empty<w:rPr>
- A paragraph
<w:p>
containing multiple<w:r>
whose<w:rPr>
are identical- A paragraph
<w:p>
containing multiple<w:r>
whose first and last<w:rPr>
are identical, but a different one in the middleI don't know in which circonstances Word can produce these cases, but they are not forbidden by the specification. Normally the filter does some tag compression in the 3 first cases, good occasion to test.
The test file is contributed from the bug reporter. I don't know how to produce these cases to be tested. @t-cordonnier could you add the test case you suggested? And also please fix the filter.