Open-XML-SDK icon indicating copy to clipboard operation
Open-XML-SDK copied to clipboard

Misc element properties are not lining up with their ECMA definitions

Open rmboggs opened this issue 5 years ago • 14 comments

Before submitting an issue, please fill this out

Is this a:

  • [X] Issue with the OpenXml library
  • [ ] Question on library usage

Description

The DocumentFormat.OpenXml.WordProcessing.OnOffOnlyValues enum only contains values On and Off but if I am reading the ECMA-376 standard correctly, properties with this value type may contain values On, Off, 0, 1, true, and false. When values, such as true or false, are present for an xml attribute for this type, the Value property of the EnumValue<OnOffOnlyType> object throws a System.FormatException exception.

The attached document contains an element /document[0]/tbl[0]/tr[0]/trPr[0]/cantSplit which contains the property in question. file-sample_500kB.docx

I'm open to discuss this further but it doesn't seem like this should be the correct behavior.

Information

  • .NET Target: .NET Framework 4.7.2
  • DocumentFormat.OpenXml Version: 2.11.3

Repro

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.IO;
using System.Linq;

namespace ConsoleApp2
{
    class Program
    {
        static void Main(string[] args)
        {
            // Change  to path of test file
            const string pathToFile = @"C:\path\to\file-sample_500kB.docx";

            using (var f = new FileStream(pathToFile, FileMode.Open))
            {
                using (var docx = WordprocessingDocument.Open(f, false))
                {
                    var sample = docx.MainDocumentPart.Document.Descendants<CantSplit>().FirstOrDefault();

                    if (sample != null)
                    {
                        var val = sample.Val;
                        Console.WriteLine($"Sample has value: {val.HasValue}");
                        Console.WriteLine($"Sample inner text: {val.InnerText}");

                        try
                        {
                            Console.WriteLine($"Sample value: {val.Value}");
                        }
                        catch (Exception x)
                        {
                            Console.WriteLine($"Sample value threw {x.GetType().Name}");
                            Console.WriteLine($"Sample value exception message: {x.Message}");
                        }
                    }
                }
            }
            Console.WriteLine("Press any key to quit");
            Console.ReadKey(true);
        }
    }
}

Observed

Sample has value: False
Sample inner text: false
Sample value threw FormatException
Sample value exception message: The text value is not a valid enumeration value.
Press any key to quit

Expected

Sample has value: True
Sample inner text: false
Sample value: DocumentFormat.OpenXml.Wordprocessing.OnOffOnlyValues.False or DocumentFormat.OpenXml.Wordprocessing.OnOffOnlyValues.Off
Press any key to quit

rmboggs avatar Jul 21 '20 13:07 rmboggs

FYI - I caught this while testing out the Serialize.OpenXml.CodeGen library.

rmboggs avatar Jul 21 '20 14:07 rmboggs

Interesting. @tomjebo can you take a look at the spec and weigh in here?

twsouthwick avatar Jul 22 '20 02:07 twsouthwick

@rmboggs Can you share the section of the spec you were looking at?

twsouthwick avatar Jul 22 '20 02:07 twsouthwick

I was referencing the first revision of the open xml ecma standard located here. Pdf Page 308 - cantSplit definition which references the possible values defined by the ST_OnOff simple type Pdf Page 1786 - ST_OnOff possible values which include the values I listed in the initial issue report above.

Please let me know if there are more questions.

rmboggs avatar Jul 22 '20 04:07 rmboggs

Actually, I dug a bit deeper here as there are other elements that also have the same possible values. DocumentFormat.OpenXml.WordProcessing.Caps for example. It has the same set of possible values as the DocumentFormat.OpenXml.WordProcessing.CantSplit but they inherit from different base classes. Caps inherits from DocumentFormat.OpenXml.Wordprocessing.OnOffType while CantSplit inherits from DocumentFormat.OpenXml.Wordprocessing.OnOffOnlyType. Since it appears that OnOffType offers the appropriate values for this kind of element, I believe that changing CantSplit to inherit from OnOffType would solve this issue rather than rewrite the OnOffOnlyType class. I'm hoping this won't be a shocking/breaking change.

rmboggs avatar Jul 22 '20 13:07 rmboggs

My gut says that's a breaking change. I'll dig into it too see. We'll probably do a v3 soon and can handle any breaking changes then

twsouthwick avatar Jul 23 '20 03:07 twsouthwick

It makes sense to be cautious. I don't think this is urgent since the cantSplit class has been set this way for quite some time and the difference is being noticed now, in a non production environment. I have an idea how to get around this in my project so I should be ok. In the meantime, would it be ok to label this for the 3.0 release so it doesn't get lost.

rmboggs avatar Jul 23 '20 04:07 rmboggs

@twsouthwick @rmboggs I think we have a mistake in the schema processing (backend). cantSplit looks like it was defined in our backend processing as CT_OnOffOnly which is not correct based on the ISO definition or what I see in the Office source schemas. I see it actually associated with both CT_OnOffOnly and CT_OnOff in our backend code and the former looks like it's overriding the latter. I think this is wrong. I'm not sure if correcting this to CT_OnOff would be breaking as it would be expanding the possible values, not taking any away. I'll need to check the other elements that are set to this type as well.

tomjebo avatar Jul 24 '20 05:07 tomjebo

Thanks @tomjebo. Please let me know if you need more information on this.

rmboggs avatar Jul 24 '20 13:07 rmboggs

This is definitely a v3.0 change as it will be a breaking change. We'll track this and get that in.

twsouthwick avatar Jul 24 '20 17:07 twsouthwick

Thanks guys, I have a work around for my stuff in the meantime.

It's not the prettiest work around but it will do.

rmboggs avatar Jul 25 '20 15:07 rmboggs

Hi,

I'm finding other non-enumvalue type properties that do not seem to be matching up to their values in some of the sample documents that I am testing against. Since they seem to be related to schema mismatches at first glance similar to this, should I append what I find to this issue or create a new one to keep things clean? Please let me know.

rmboggs avatar Jul 27 '20 16:07 rmboggs

Yes, let's make this an uber-issue for those kind of issues. Then we can track it for v3.0

twsouthwick avatar Jul 27 '20 19:07 twsouthwick

Ok, let me see if I can modify the title to be more broad in a bit.

In the meantime, I'm seeing more inconsistencies in the sample docx file mentioned at the beginning of this issue when compared to the ecma standard. First one is for the DocGrid.CharacterSpace property. The property type is set to Int32Value but the sample document attached has 4294961151 as its value in the SectionProperties element, which is out of range for that type.. When I check the ecma standard, it says the values for this type should be a positive/negative decimal number (defined in 2.18.16 ST_DecimalNumber (Decimal Number Value)) which is pretty vague in terms of what the minimum/maximum values should be. On the flip side, DivId.Val, which has the same definition as the DocGrid.CharacterSpace property in the ecma standard (2.18.16 ST_DecimalNumber (Decimal Number Value)), is setup as a StringValue type in the SDK. I have yet to check out the actual code for these two properties but if they are generated the same way as the CantSplit type, then there is another issue with the generation process. Chances are that the disconnect between the two properties is due to the ambiguity of the ecma schema definition but it should be looked at, imho.

Please let me know if more details are needed for this.

rmboggs avatar Jul 27 '20 20:07 rmboggs