Open-XML-SDK icon indicating copy to clipboard operation
Open-XML-SDK copied to clipboard

DOCX Validation reporting 0 errors on invalid document

Open rysavyjan opened this issue 6 years ago • 9 comments

[x] Issue with the OpenXml library

Description

Invalid DOCX document that cannot be opened with Microsoft Word 2019 is reported as valid by Open XML SDK Validator.

Information

  • .NET Target: .NET Framework 4.7.2
  • DocumentFormat.OpenXml Version: 2.9.1

Repro

Create a new valid DOCX document containing table and remove <w:p /> child element from <w:tc> parent.

wp

Validation will report 0 errors, while Microsoft Word would report error: image

I'm attaching such corrupted document. word.docx

rysavyjan avatar May 09 '19 04:05 rysavyjan

As of #603, this is much easier to debug.

It appears that the constraint information is this:

https://raw.githubusercontent.com/OfficeDev/Open-XML-SDK/master/src/DocumentFormat.OpenXml/GeneratedCode/schemas_openxmlformats_org_wordprocessingml_2006_main.g.cs on line 16152 (the file is too big for GitHub to show a link). Here's the code:

private static readonly ParticleConstraint _constraint = new CompositeParticle(ParticleType.Sequence, 1, 1)
{
    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.TableCellProperties), 0, 1),
    new CompositeParticle(ParticleType.Group, 1, 0)
    {
        new CompositeParticle(ParticleType.Choice, 1, 1)
        {
            new CompositeParticle(ParticleType.Group, 0, 0)
            {
                new CompositeParticle(ParticleType.Choice, 1, 1)
                {
                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.AltChunk), 0, 0)
                }
            },
            new CompositeParticle(ParticleType.Group, 0, 0)
            {
                new CompositeParticle(ParticleType.Choice, 1, 1)
                {
                    new CompositeParticle(ParticleType.Group, 0, 0)
                    {
                        new CompositeParticle(ParticleType.Choice, 1, 1)
                        {
                            new CompositeParticle(ParticleType.Group, 0, 0)
                            {
                                new CompositeParticle(ParticleType.Choice, 1, 1)
                                {
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlBlock), 1, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.SdtBlock), 1, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.Paragraph), 0, 0),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.Table), 0, 0)
                                }
                            },
                            new CompositeParticle(ParticleType.Group, 0, 0)
                            {
                                new CompositeParticle(ParticleType.Choice, 1, 1)
                                {
                                    new CompositeParticle(ParticleType.Group, 0, 0)
                                    {
                                        new CompositeParticle(ParticleType.Choice, 1, 1)
                                        {
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.ProofError), 0, 1),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.PermStart), 0, 1),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.PermEnd), 0, 1)
                                        }
                                    },
                                    new CompositeParticle(ParticleType.Group, 0, 0)
                                    {
                                        new CompositeParticle(ParticleType.Choice, 1, 1)
                                        {
                                            new CompositeParticle(ParticleType.Group, 0, 0)
                                            {
                                                new CompositeParticle(ParticleType.Choice, 1, 1)
                                                {
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.BookmarkStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.BookmarkEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CommentRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CommentRangeEnd), 1, 1)
                                                }
                                            },
                                            new CompositeParticle(ParticleType.Group, 0, 0)
                                            {
                                                new CompositeParticle(ParticleType.Choice, 1, 1)
                                                {
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveFromRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveFromRangeEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveToRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveToRangeEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlInsRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlInsRangeEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlDelRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlDelRangeEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlMoveFromRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlMoveFromRangeEnd), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlMoveToRangeStart), 1, 1),
                                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.CustomXmlMoveToRangeEnd), 1, 1)
                                                }
                                            },
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.CustomXmlConflictInsertionRangeStart), 0, 1, version: FileFormatVersions.Office2010),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.CustomXmlConflictInsertionRangeEnd), 0, 1, version: FileFormatVersions.Office2010),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.CustomXmlConflictDeletionRangeStart), 0, 1, version: FileFormatVersions.Office2010),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.CustomXmlConflictDeletionRangeEnd), 0, 1, version: FileFormatVersions.Office2010)
                                        }
                                    },
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.InsertedRun), 0, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.DeletedRun), 0, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveFromRun), 1, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.MoveToRun), 1, 1),
                                    new ElementParticle(typeof(DocumentFormat.OpenXml.Wordprocessing.ContentPart), 0, 0, version: FileFormatVersions.Office2010),
                                    new CompositeParticle(ParticleType.Group, 0, 1, version: FileFormatVersions.Office2010)
                                    {
                                        new CompositeParticle(ParticleType.Sequence, 1, 1)
                                        {
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.RunConflictInsertion), 0, 1, version: FileFormatVersions.Office2010),
                                            new ElementParticle(typeof(DocumentFormat.OpenXml.Office2010.Word.RunConflictDeletion), 0, 1, version: FileFormatVersions.Office2010)
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
};

It appears that the constraint for a Paragraph is from 0 to infinity (0 in the max field means inifinity - not sure why). @tomjebo Should this require 1?

twsouthwick avatar Jul 11 '19 18:07 twsouthwick

Stale issue message

github-actions[bot] avatar May 14 '20 00:05 github-actions[bot]

I ran into this same issue recently and traced it back to here eventually. @twsouthwick , are you aware of anything committed to address this? I'm concerned about your suggested fix (change the minimum constraint on Paragraph from 0 to 1) because, according to my reading of the spec, a table cell "must contain at least one block-level element" -- which could be fulfilled by either a paragraph OR a nested table. Not sure how best to represent that in the validation code??

lowellstewart avatar Dec 16 '20 17:12 lowellstewart

I do not know of anything specifically to address this. Reopening for discussion.

twsouthwick avatar Dec 16 '20 17:12 twsouthwick

The first step to identify what the issue would be to have a bunch of examples of allowed and not allowed set ups. Ideally, this would be in the form of XML snippets. We have the example above that would give the following:

<w:tc>
    <w:tcPr>
        <w:tcBorders>
            <w:top w:val="double"
                    w:sz="12"
                    w:space="0"
                    w:color="auto" />
            <w:left w:val="double"
                    w:sz="12"
                    w:space="0"
                    w:color="auto" />
        </w:tcBorders>
    </w:tcPr>
    <w:p />
</w:tc>

Where there's an expected error of the w:p element.

More of these will help identify if the issue is with the validation system or the schema layout (as this schema element is by no means straightfoward)

twsouthwick avatar Dec 16 '20 17:12 twsouthwick

Interesting. I tried some experiments, and it appears that Word DOES require a paragraph in every table cell. I constructed the nested table scenario -- where in the Word UI, the outer table cell appeared to contain ONLY the inner table, no paragraphs. Looking at the XML, there was still an empty paragraph in there. Removing the empty paragraph -- leaving only the nested table -- and Word gave the same "unreadable content" error when opening that file. It would appear the documentation I was reading, was not the most reliable.

So maybe enforcing a minimum count of 1 on paragraphs inside table cells would indeed be a safe fix.

lowellstewart avatar Dec 16 '20 19:12 lowellstewart

In other words, Word opened this without complaint:

<w:tbl>
  <w:tr>
    <w:tc>
      <w:tbl>
        <w:tr>
          <w:tc>
            <w:p/>
          </w:tc>
        </w:tr>
      </w:tbl>
      <w:p/>
    </w:tc>
  </w:tr>
</w:tbl>

... but gave an error on this:

<w:tbl>
  <w:tr>
    <w:tc>
      <w:tbl>
        <w:tr>
          <w:tc>
            <w:p/>
          </w:tc>
        </w:tr>
      </w:tbl>
    </w:tc>
  </w:tr>
</w:tbl>

lowellstewart avatar Dec 16 '20 19:12 lowellstewart

Another interesting case: block-level content controls (<w:sdt>). Word appears to accept, without complaint, a table cell that contains block-level content control... in fact whether that content control itself contains a paragraph or not. And Word accepts a <w:tc> that has a content control "in" it, whether the XML has been structured (as Word does on Save) such that the <w:sdt> is OUTSIDE or INSIDE the <w:tc>.

BUT the correlation between what the existing SDK considers "valid", and what Word will open without complaining, really starts to diverge here. (I don't know how close those two things were to begin with... I assume the validator is GENERALLY more strict than Word is, but I really have no idea.) I have found a couple cases where Word will open things, but the validator code (at least the version of it that I'm running) gives me an error. (LMK if you're interested in these cases, and I can post them--but otherwise they're beyond what I really care about at this point!)

lowellstewart avatar Dec 16 '20 20:12 lowellstewart