media-crawler icon indicating copy to clipboard operation
media-crawler copied to clipboard

Parse NPR Articles

Open josephpd3 opened this issue 7 years ago • 4 comments

Using the WashingtonPost parser as an example, we want to create another parser for this source. Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

josephpd3 avatar Sep 30 '17 15:09 josephpd3

@josephpd3 I think we shouldn't use web scraping for this, but just rely on the NPR API. Thoughts?

brycecf avatar Oct 01 '17 13:10 brycecf

@brycecf I really like their API, but looking at the story-level output, I don't know if it can meet the requirements for the data we want to scrape.

relatedLink seems to be the closest thing (to a reference defined in this scope) which is received in the API calls, but that doesn't seem to say whether it is an article-supporting link or just related content :(

It also doesn't seem to tell us what sort of context the link is defined in, so we can't really infer that if we wanted to try.

josephpd3 avatar Oct 01 '17 17:10 josephpd3

@josephpd3 I have this implemented. I'll make a pull request later on today.

brycecf avatar Oct 02 '17 05:10 brycecf

@ brycecf , could you remove the help wanted tag

avhirupc avatar Jun 20 '18 11:06 avhirupc