scrapemark
scrapemark copied to clipboard
Attribute value ignored when capturing another attribute value in the same tag
First, thanks for this wonderful tool!
I have the following problem: when trying the following snippet:
import scrapemark
html = """
<div>
<a href="http://site.com/page1" title="Page 1">Page 1</a>
<a href="http://site.com/page2" title="Page 2">Page 2</a>
<a href="http://site.com/page3" title="Page 3">Page 3</a>
<a href="http://site.com/page2" title="Next">></a>
</div>
"""
res = scrapemark.scrape("""<a href="{{ nextpage }}" title="Next" />""", html)
print res
results in:
{'nextpage': u'http://site.com/page1'}
which is simply the first link, not the link to the next page as I would expect.
Capturing to a list with:
res = scrapemark.scrape("""{* <a href="{{ [nextpage] }}" title="Next" /> *}""", html)
returns:
{'nextpage': [u'http://site.com/page1', u'http://site.com/page2', u'http://site.com/page3', u'http://site.com/page2']}
i.e. a list of all links, not a list with just the link to the next page.
It appears as if scrapemark is ignoring the title attribute's value when it can capture the href attribute's value. Am I doing something wrong here, is this simply a quirk we'll have to be aware of, or is this a bug?
Pull request #15 from quink seems to resolve this issue.