easy-scraper
easy-scraper copied to clipboard
Ignore tags in target locations
I have a two challenging cases:
- where there are tags within a target location.
Raw doc:
<h class= ”name” Baz =“key here”> this is the text <sub>we</sub> want </h>
Note, no quotes around the target text and it sits between the opening and closing tag.
Pattern:
R##” <h class= ”name” Baz ={{key}}> {{this}} </h>
Can we ignore all these tags?
- Variable number of author target tags in a doc
<h class= ”bar” baz= “one”> <span itemprop =”name”>bla</span> </h> <h class= ”bar” baz= “two”> <span itemprop =”name”>bla</span> <span itemprop =”name”>foo</span> </h>
Pattern:
R##” <h class= ”bar” baz={{key}}> <span itemprop =”name”>{{auth}}</span> </h>
Can we just take the first itemprop?
Both examples produce multiples. A work around could be to combine based on a common key. This wouldn’t work in all cases.
Ok, I read through the code and have some solutions.
R##” <h class= ”name” Baz ={{key}}> {{this}} </h>
Becomes
R##” <h class= ”name” Baz ={{key}}> {{this:*}} </h>
And
R##” <h class= ”bar” baz={{key}}> <span itemprop =”name”>{{auth}}</span> </h>
I think, becomes
R##” <h class= ”bar” baz={{key}}> <span itemprop =”name”>{{auth}}</span> ... <span itemprop =”name”>{{authother}}</span> </h>
Then there’s some extra processing to combine vectors with common 1st authors. I’m not sure how this would work with only one author though.