go-xml icon indicating copy to clipboard operation
go-xml copied to clipboard

Unable to unmarshal embedded xhtml

Open simar7 opened this issue 3 years ago • 3 comments

hi! – first off, thanks for the amazing software. It really does help a lot!

I just had a quick question about an XML I was trying to unmarshal using the generated structs.

Most of the document gets parsed just fine expect the embedded XHTML elements. One example would be something like the following:

<Extended_Description>
    <xhtml:p>Such a scenario is commonly observed when:</xhtml:p>
    <xhtml:ol>
        <xhtml:li>A web application authenticates a user without first invalidating the existing session, thereby continuing to use the session already associated with the user.</xhtml:li>
        <xhtml:li>An attacker is able to force a known session identifier on a user so that, once the user authenticates, the attacker has access to the authenticated session.</xhtml:li>
        <xhtml:li>The application or container uses predictable session identifiers. In the generic exploit of session fixation vulnerabilities, an attacker creates a new session on a web application and records the associated session identifier. The attacker then causes the victim to associate, and possibly authenticate, against the server using that session identifier, giving the attacker access to the user's account through the active session.</xhtml:li>
    </xhtml:ol>
</Extended_Description>

In this case the output after the unmarshal is only the items with the <p> tag. The items with <ol>,<li> don't show up.

Here's the schema if it would help: https://cwe.mitre.org/data/xsd/cwe_schema_latest.xsd

Let me know if you need anything in terms of information or if you'd like me to try something. If you could point out the relevant code that is involved in generating this I'd be happy to look through and submit a PR if I managed to solve it.

simar7 avatar Jul 30 '20 05:07 simar7

Some more info: These are the generated structs from the schema.

// The StructuredTextType complex type is used to allow XHTML content embedded within standard string data. Some common elements are: <BR/> to insert a line break, <UL><LI/></UL> to create a bulleted list, <OL><LI/></OL> to create a numbered list, and <DIV style="margin-left: 40px"></DIV> to create a new indented section.
type StructuredTextType []string

func (a StructuredTextType) MarshalXML(e *xml.Encoder, start xml.StartElement) error {
	var output struct {
		ArrayType string   `xml:"http://schemas.xmlsoap.org/wsdl/ arrayType,attr"`
		Items     []string `xml:" item"`
	}
	output.Items = []string(a)
	start.Attr = append(start.Attr, xml.Attr{Name: xml.Name{Space: " ", Local: "xmlns:ns1"}, Value: "http://www.w3.org/2001/XMLSchema"})
	output.ArrayType = "ns1:anyType[]"
	return e.EncodeElement(&output, start)
}
func (a *StructuredTextType) UnmarshalXML(d *xml.Decoder, start xml.StartElement) (err error) {
	var tok xml.Token
	for tok, err = d.Token(); err == nil; tok, err = d.Token() {
		if tok, ok := tok.(xml.StartElement); ok {
			var item string
			if err = d.DecodeElement(&item, &tok); err == nil {
				*a = append(*a, item)
			}
		}
		if _, ok := tok.(xml.EndElement); ok {
			break
		}
	}
	return err
}

simar7 avatar Jul 30 '20 05:07 simar7

Thanks for the report.

The StructuredTextType type from your schema has the mixed=true attribute, so xsdgen should be generating a struct with a ,chardata field instead of what you got. Something like

type StructuredTextType struct {
    Value string `xml:",chardata"`
}

I'm not sure why that's not happening. My guess is that the SOAPArrayAsSlice pass is accidentally removing the mixed content model. What happens if you don't include this optimization? You can see the "Customizing the behavior of xsdgen" section of https://blog.aqwari.net/xml-schema-go/ for instructions on how to choose what optimizations are included.

If that's the issue, we could add a check to this optimization for a mixed content model.

droyo avatar Jul 31 '20 02:07 droyo

hi @droyo – So I gave that a try. This was my generator program:

package main

import (
	"log"
	"os"

	"aqwari.net/xml/xsdgen"
)

func main() {
	var cfg xsdgen.Config
	cfg.Option(
		xsdgen.LogOutput(log.New(os.Stderr, "", 0)),
		xsdgen.IgnoreElements("virtual", "sequence"))

	cfg.Option([]xsdgen.Option{
		xsdgen.IgnoreAttributes("id", "href", "ref", "offset"),
		xsdgen.Replace(`[._ \s-]`, ""),
		xsdgen.PackageName("ws"),
		xsdgen.HandleSOAPArrayType(),
		xsdgen.UseFieldNames(),
	}...)

	if err := cfg.GenCLI(os.Args[1:]...); err != nil {
		log.Fatal(err)
	}
}

But the result I got for StructuredTextType was the following:

type StructuredTextType struct {
	Items []string `xml:",any"`
}

With no corresponding Marshal or Unmarshal functions.

Is this expected? What could I look into next? Let me know if there's anything else I can try here.

simar7 avatar Oct 10 '20 03:10 simar7