news-crawl
news-crawl copied to clipboard
Use wikidata to complete seeds
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
- select all instances of newspaper (news media, or similar) having an official website:
(execute query on Wikidata query service)SELECT DISTINCT ?item ?itemLabel ?lang ?url WHERE { ?item wdt:P31/wdt:P279* wd:Q11032. ?item wdt:P856 ?url. # with official website SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" } OPTIONAL { ?item wdt:P407 ?language. ?language wdt:P220 ?lang. } } LIMIT 50