scrapemark Attribute value ignored when capturing another attribute value in the same tag

Attribute value ignored when capturing another attribute value in the same tag

Open ackalker opened this issue 12 years ago • 1 comments

First, thanks for this wonderful tool!

I have the following problem: when trying the following snippet:

import scrapemark

html = """
    <div>
    <a href="http://site.com/page1" title="Page 1">Page 1</a>
    <a href="http://site.com/page2" title="Page 2">Page 2</a>
    <a href="http://site.com/page3" title="Page 3">Page 3</a>
    <a href="http://site.com/page2" title="Next">&gt;</a>
    </div>
    """

res = scrapemark.scrape("""<a href="{{ nextpage }}" title="Next" />""", html)
print res

results in: {'nextpage': u'http://site.com/page1'} which is simply the first link, not the link to the next page as I would expect. Capturing to a list with:

res = scrapemark.scrape("""{* <a href="{{ [nextpage] }}" title="Next" /> *}""", html)

returns: {'nextpage': [u'http://site.com/page1', u'http://site.com/page2', u'http://site.com/page3', u'http://site.com/page2']} i.e. a list of all links, not a list with just the link to the next page.

It appears as if scrapemark is ignoring the title attribute's value when it can capture the href attribute's value. Am I doing something wrong here, is this simply a quirk we'll have to be aware of, or is this a bug?

May 23 '12 19:05 ackalker

Pull request #15 from quink seems to resolve this issue.

May 23 '12 20:05 ackalker

scrapemark scrapemark copied to clipboard

Attribute value ignored when capturing another attribute value in the same tag

scrapemark
scrapemark copied to clipboard