goquery icon indicating copy to clipboard operation
goquery copied to clipboard

html entity escap error

Open WebFeng opened this issue 2 years ago • 1 comments

doc, _ := goquery.NewDocumentFromReader(strings.NewReader("<body><!--<p></p>&lt!--[video]--&gt--></body>"))
fmt.Println(doc.Html())

// want  <html><body><!--<p></p>&lt!--[video]--&gt--></body></html> <nil>
// got <html><head></head><body><!--<p></p><!--[video]-->--></body></html> <nil>

WebFeng avatar Sep 07 '21 12:09 WebFeng

Hello,

Good catch, looks like indeed the comments are not escaped when rendered (though I'm not 100% sure if it is a bug or working as intended per the html5 spec - i.e. once unescaped, it's impossible to distinguish that the <p></p> part should stay unescaped, while the <!--[video]--> part should be escaped), however this is a (potential) bug that should be reported to https://github.com/golang/go/issues?q=is%3Aissue+x%2Fnet%2Fhtml (for the x/net/html package), as goquery does not parse/render html itself, it uses the golang.org/x/net/html package for this (an html5 parser implemented by the Go team, although not part of the stdlib).

This program reproduces the issue with the net/html package directly, without using goquery:

func main() {
	root, err := html.Parse(strings.NewReader("<body><!--<p></p>&lt!--[video]--&gt--></body>"))
	if err != nil {
		log.Fatal(err)
	}

	c := findCommentNode(root)
	fmt.Printf("Comment Node Type: %d; Data: %q\n\n", c.Type, c.Data)

	fmt.Println("html.Render:")
	if err := html.Render(os.Stdout, c); err != nil {
		log.Fatal(err)
	}
	fmt.Println()
}

func findCommentNode(n *html.Node) *html.Node {
	if n.Type == html.CommentNode {
		return n
	}
	for n := n.FirstChild; n != nil; n = n.NextSibling {
		if nn := findCommentNode(n); nn != nil {
			return nn
		}
	}
	return nil
}

// Prints:
// Comment Node Type: 4; Data: "<p></p><!--[video]-->"
//
// html.Render:
// <!--<p></p><!--[video]-->-->

If you'd be so kind as to either link back to this issue from the net/html one, or just post the link to the issue in a comment here, I'd keep this open until there's a fix or decision made in the net/html repo.

Thanks, Martin

mna avatar Sep 07 '21 23:09 mna

This should be fixed in https://github.com/golang/net/commit/06994584191ebed30077b5176cefe09703557528. Can goquery update it's pinned dependency on x/net to >= v0.1.0?

bensie avatar Nov 15 '22 01:11 bensie

I updated goquery's go.mod file to use the latest x/net dependency, but AFAIK it shouldn't matter - the go.mod that matters is the one of the main package, and its dependencies' go.mod is ignore (that is, you can use the latest version of golang.org/x/net even if goquery's go.mod is not updated).

mna avatar Nov 17 '22 02:11 mna

Closing now that the x/net issue is merged.

mna avatar Nov 17 '22 02:11 mna