news-crawl Use wikidata to complete seeds

Use wikidata to complete seeds

Open sebastian-nagel opened this issue 1 year ago • 1 comments

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:

select all instances of newspaper (news media, or similar) having an official website:

SELECT DISTINCT ?item ?itemLabel ?lang ?url
WHERE
{ 
  ?item wdt:P31/wdt:P279* wd:Q11032.
  ?item wdt:P856 ?url.  # with official website
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
  OPTIONAL {
     ?item wdt:P407 ?language.
     ?language wdt:P220 ?lang.
   }
}
LIMIT 50

(execute query on Wikidata query service)

Oct 18 '22 13:10 sebastian-nagel

news-crawl news-crawl copied to clipboard

Use wikidata to complete seeds

news-crawl
news-crawl copied to clipboard