omegat [BUGS#1038] MsOfficeFileFilter does not pass the regression test case posted in BUGS#1038

Add reproducible test case for BUGS#1038

Pull request type

Please mark github LABEL of the type of change your PR introduces:

Bug fix -> [bug]

Which ticket is resolved?

Bugs: https://sourceforge.net/p/omegat/bugs/1038/
Option "Remove leading and trailing tags" removes tags that are not a tag pair

What does this PR change?

Add TWO test cases with the same source file
- ~~org.omegat.filters.OpenXMLFilterTest#testParseTags~~ -> unconditionally ignored/skipped
- org.omegat.filters4.OpenXMLFilterTest#testParseTags

Other information

Nov 07 '23 23:11 miurahr

The test data which posted in bug ticket contains document.xml like;

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        mc:Ignorable="w14 wp14">
    <w:body>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr> <w:b/> <w:b/> <w:bCs/> </w:rPr> </w:pPr>
            <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow has a black cat.</w:t> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
            <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow</w:t> </w:r>
            <w:r> <w:rPr></w:rPr> <w:t xml:space="preserve"> has a </w:t>
            </w:r> <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>black cat.</w:t> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
            <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>The black widow</w:t> </w:r>
            <w:r> <w:rPr></w:rPr> <w:t xml:space="preserve"> has a black cat.</w:t> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr> <w:r> <w:rPr></w:rPr> </w:r>
        </w:p>
        <w:p> <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
            <w:r> <w:rPr></w:rPr> <w:t xml:space="preserve">The black widow has a </w:t> </w:r>
            <w:r> <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t>black cat.</w:t> </w:r>
        </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr></w:rPr> </w:pPr>
            <w:r> <w:rPr></w:rPr> </w:r> </w:p>
        <w:p>
            <w:pPr> <w:pStyle w:val="Normal"/> <w:bidi w:val="0"/> <w:jc w:val="start"/> <w:rPr> <w:b/> <w:b/> <w:bCs/> </w:rPr> </w:pPr>
            <w:r>
                <w:rPr> <w:b/> <w:bCs/> </w:rPr>
                <w:t xml:space="preserve">The black widow has a </w:t> </w:r>
            <w:r>
                <w:rPr> <w:b/> <w:bCs/> <w:i/> <w:iCs/> </w:rPr> <w:t>black</w:t>
            </w:r>
            <w:r>
                <w:rPr> <w:b/> <w:bCs/> </w:rPr> <w:t xml:space="preserve"> cat.</w:t>
            </w:r>
        </w:p>
        <w:sectPr>
            <w:type w:val="nextPage"/>
            <w:pgSz w:w="11906" w:h="16838"/>
            <w:pgMar w:left="1134" w:right="1134" w:header="0" w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0"/>
            <w:pgNumType w:fmt="decimal"/>
            <w:formProt w:val="false"/>
            <w:textDirection w:val="lrTb"/>
        </w:sectPr>
    </w:body>
</w:document>

(indent by @miurahr)

Nov 08 '23 00:11 miurahr

@t-cordonnier You can find an assertion error such as

Expect: [file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<p0>The black widow</p0> has a <p1>black cat.</p1>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a black cat.']
Actual:[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<e0/>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a <p1>black cat.</p1>']

There may be an unexpected source='<e0/>' produced by MsOfficeFileFilter.

Nov 08 '23 00:11 miurahr

@didierbr Could you check expectations as a bug ticket follower?

Nov 08 '23 00:11 miurahr

See Thomas comment in the bug report.

Nov 08 '23 14:11 didierbr

See Thomas comment in the bug report.

now ignore the test on old filter.

@t-cordonnier it still be failed with new MsOfficeFileFilter

CI error summary is here https://dev.azure.com/omegat-org/OmegaT/_build/results?buildId=6815&view=ms.vss-test-web.build-test-results-tab&runId=1004201&resultId=100327&paneView=debug

Nov 09 '23 03:11 miurahr

For convenience, copied to here

java.lang.AssertionError: 
expected:<[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<p0>The black widow</p0> has a <p1>black cat.</p1>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a black cat.']> 
but was:<[file:test/data/filters/openXML/file-OpenXMLFilter-tags.docx, id=null, path=null, source='<e0/>', prev='The black widow has a black cat.', next='<p0>The black widow</p0> has a <p1>black cat.</p1>']>

Nov 10 '23 01:11 miurahr

expected:<[...] source='<p0>The black widow</p0> has a <p1>black cat.</p1>', but was:<[...] source='<e0/>',

Sounds logical to me. In line 113 you expect the next segment to be <e0/> but in line 114 you expect that this segment with <e0/> has been bypassed

Nov 10 '23 06:11 t-cordonnier

expected:<[...] source='The black widow has a black cat.', but was:<[...] source='',

Sounds logical to me. In line 113 you expect the next segment to be <e0/> but in line 114 you expect that this segment with <e0/> has been bypassed

@t-cordonnier It is good things that sound logical to you. My expectation is <p0>The black widow</p0> has a <p1>black cat.</p1> You think it is logical that MsOfficeFileFilter produce <e0/> . So what?

Nov 10 '23 10:11 miurahr

Is an expectation come from bug report wrong? Or any potential defeat in the filter?

Nov 11 '23 02:11 miurahr

The 3 first segments in your document look like this:

The black widow has a black cat.
<e0/>
<p0>The black widow</p0> has a <p1>black cat.</p1>

At line 113 you check segment 1, preceeded by null and followed by <e0/> which is correct. But at line 114 you expect that the next segment contains <p0>The black widow</p0> has a <p1>black cat.</p1> Between line 113 and 114 you forgot to test segment 2! So, your test is comparing segment 2 with 3, and of course they are not equal!

In case you ask yourself where segment 2 comes from, look at your XML : each <w:p> will generate one segment, even if the second one contains no text. On the other hand, you should have noticed that in line 113 next = <e0/> so the following segment to be tested could not be anything else than <e0/>

Nov 11 '23 07:11 t-cordonnier

Now to be complete I suggest that your test also checks the following cases:

A paragraph <w:p> containing multiple <w:r> with text but no <w:rPr>
A paragraph <w:p> containing multiple <w:r> with text but empty <w:rPr>
A paragraph <w:p> containing multiple <w:r> whose <w:rPr> are identical
A paragraph <w:p> containing multiple <w:r> whose first and last <w:rPr> are identical, but a different one in the middle

I don't know in which circonstances Word can produce these cases, but they are not forbidden by the specification. Normally the filter does some tag compression in the 3 first cases, good occasion to test.

Nov 11 '23 07:11 t-cordonnier

Now to be complete I suggest that your test also checks the following cases:

A paragraph <w:p> containing multiple <w:r> with text but no <w:rPr>

A paragraph <w:p> containing multiple <w:r> with text but empty <w:rPr>

A paragraph <w:p> containing multiple <w:r> whose <w:rPr> are identical

A paragraph <w:p> containing multiple <w:r> whose first and last <w:rPr> are identical, but a different one in the middle

I don't know in which circonstances Word can produce these cases, but they are not forbidden by the specification. Normally the filter does some tag compression in the 3 first cases, good occasion to test.

The test file is contributed from the bug reporter. I don't know how to produce these cases to be tested. @t-cordonnier could you add the test case you suggested? And also please fix the filter.

Nov 15 '23 15:11 miurahr

omegat omegat copied to clipboard

[BUGS#1038] MsOfficeFileFilter does not pass the regression test case posted in BUGS#1038

Pull request type

Which ticket is resolved?

What does this PR change?

Other information

omegat
omegat copied to clipboard