news-crawl icon indicating copy to clipboard operation
news-crawl copied to clipboard

Use wikidata to complete seeds

Open sebastian-nagel opened this issue 1 year ago • 1 comments

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:

  • select all instances of newspaper (news media, or similar) having an official website:
    SELECT DISTINCT ?item ?itemLabel ?lang ?url
    WHERE
    { 
      ?item wdt:P31/wdt:P279* wd:Q11032.
      ?item wdt:P856 ?url.  # with official website
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
      OPTIONAL {
         ?item wdt:P407 ?language.
         ?language wdt:P220 ?lang.
       }
    }
    LIMIT 50
    
    (execute query on Wikidata query service)

sebastian-nagel avatar Oct 18 '22 13:10 sebastian-nagel