backend icon indicating copy to clipboard operation
backend copied to clipboard

1-round spidering problem/error

Open dsjen opened this issue 3 years ago • 0 comments

via @ebndulue

As part of our project trying to identify when preprint server URLs are linked to in news, we ran a topic for all stories (so, a * query) from the Nigeria - National collection for one day, with 1 round of spidering. So our expected results would be all the stories published by those Nigerian news outlets on that date, as well as any URLs that those articles directly link to. Here is a link to the topic: https://topics.mediacloud.org/#/topics/5530/summary?focusId&q&snapshotId=6351&timespanId=1255893.

I queried the topic for the media_id of one of the preprint servers we have identified, arxiv.org (id: 19472), and found 1 article. When clicking on the Links tab (view here: https://topics.mediacloud.org/#/topics/5530/stories/1834760437?focusId&q&snapshotId=6351&timespanId=1255893), you can see that the news source that links to the arxiv.org article is Wired.com. Wired is a US news source and is not part of the Nigeria collection. So, if Wired is the only article that links to the Arxiv.org URL, then the Arxiv.org URL should not be in the topic. Clearly some additional spidering has happened here beyond the initial 1 round.

As I mentioned, a similar error has happened to me in 1-round spiderered topic(s) before, and Hal said it was somewhat inevitable for things to sneak in sometimes, but I didn't understand why.

Anything you can explain about this, and ways we can address it, would be much appreciated! Several of our research questions have to do with limiting the scope of a topic to only a select set of news sources and the stories they link to, so having that feature be unreliable is somewhat of a challenge.

dsjen avatar Mar 03 '21 16:03 dsjen