htmlquery icon indicating copy to clipboard operation
htmlquery copied to clipboard

Is it supposed to return the body node?

Open zzwx opened this issue 3 years ago • 5 comments

For some reason htmlquery.Find(parse, "/html/body//*") returns the body node too. I've tested that using https://codebeautify.org/Xpath-Tester as well as $x("/html/body//*") in the browser console and it doesn't seem to include body nodes. What am I missing?

zzwx avatar Jan 18 '21 17:01 zzwx

First you get /html/body node and then traveling all child of body node via FirstChild and NextSibling. https://godoc.org/golang.org/x/net/html#Node

Like the below code:

body, err := htmlquery.Findone(doc,"//body")
if err != nil {
	// ...
}
for c := body.FirstChild; c != nil; c = c.NextSibling {
    fmt.Println(c)
}

zhengchun avatar Jan 19 '21 12:01 zhengchun

@zhengchun I honestly don't understand why you closed the issue. I wasn't looking for a workaround. The query specifically says /html/body//* and requests all immediate and non-immediate children of body. So its returning of the body node seems weird, is it not?

zzwx avatar Jan 19 '21 14:01 zzwx

Sorry, I thought it was solving the child node of body. I open it again.

zhengchun avatar Jan 20 '21 02:01 zhengchun

I checked again and I understand your mean. I guess is /html/body//* not a fullly valid expression in htmlquery. You want query all child of body but exclude itself, you should change to /html/body/*.

If you use // characters in htmlquery that means query descendant-or-self::*. https://github.com/antchfx/xpath

a//b : For each node matching a, add the descendant nodes matching b to the result.

In my opinion, like "/a//b" is not a valid query expression.

zhengchun avatar Jan 20 '21 02:01 zhengchun

As it seems to be a valid syntax, that could be confirmed by $x(<expr>) in the browser, may this be safely said that it is due to a bug? It's really hard to find a good XPath documentation with examples, apart of the spec which seems to say that both / and // are defined from the context of a previous portion.

That seems to mean that /html/body//* locate all descendants of the /html/body, not necessarily immediate children.

zzwx avatar Jan 20 '21 03:01 zzwx

Solved. https://github.com/antchfx/xpath/commit/ed2f6ee6ff2a4066b9ea6b35db2868d4d2154cb5

zhengchun avatar Apr 01 '24 08:04 zhengchun