goquery icon indicating copy to clipboard operation
goquery copied to clipboard

Function to find the selector of a node

Open sundy-li opened this issue 6 years ago • 5 comments

  //node is a sub html*node of doc
  // if ok return the select string such as `.sidebar-reviews article .content-block a`
   sel, ok :=  doc.FindSelector(node) 

Could this function be possible?

sundy-li avatar Aug 09 '17 03:08 sundy-li

Hello,

It could be done, but there can be many valid selector strings for a given node, and there's no guarantee that this selector would be unique (that is, the selector could return many matches, not just the one for that specific node). I guess it could be made unique by adding :nth-child pseudo classes everywhere, but not sure that would be super useful.

What do you want to achieve exactly?

Martin

mna avatar Aug 09 '17 20:08 mna

Hello, I have thought about it, what I want to achieve is when I search a html dom tree, I find a node useful to me by some judge algorithms, I would store it's selector in database for future use

For example, I want to crawl thousand blogs newest article urls, the html dom tree is different to varying blogs, when add a blog index url, I want store all the selector of node <a href="{newest article url }"> in my db by some algorithms.

But I notice that

  1. Goquery it a library to select the html node like jquery, so this feature may be a little different with goquery's goal.

  2. The selector is not unique for a given node (For now on, use recursive parent search to get the select of a node, to make the unique selector with :nth-child when parent has siblings ), It would be nice just like chrome did

image

sundy-li avatar Aug 12 '17 04:08 sundy-li

Thanks for the context, yeah I see what you mean, I think it makes sense. I'm gonna try to give this a shot, maybe this weekend (no promises :). I'll take a look at how Chrome handles this, but I think another option is to have an array of html.Node indices to traverse the tree (instead of a css selector string). Maybe offer both options.

mna avatar Aug 12 '17 14:08 mna

I implemented the PathForNode(*html.Node) []int and NodeAtPath([]int) *html.Node functions in the wip-selector branch. That's not exactly what you wanted, but that's the same-ish feature. It works well, though it's not so nice to use because it works with *html.Node instead of *goquery.Selection (however it can still be useful as it is probably more efficient to match, retrieve and to store than the string selector version will be).

I'll try to add the selector string thing at some point which will fit better with the rest of the goquery API.

mna avatar Aug 21 '17 12:08 mna

Thanks for the quick implement,I saw the commit, and the PathForNode is a good path sign of a html.Node is a Dom tree. Though It's not exactly I want , I could use it to be a be a pointer which could be saved in database, so it's useful, I will keep waiting for your better implement~

sundy-li avatar Oct 09 '17 00:10 sundy-li