news-crawl
news-crawl copied to clipboard
News crawling with StormCrawler - stores content as WARC
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) could be a replacement to complete...
I'm not sure if this is the right place to ask this, (feel free to direct me where) But would it be possible to also produce WET files from this...
See also [this discussion on Common Crawl's user group](https://groups.google.com/g/common-crawl/c/CKPaaMCga6Y/m/xhoAkC9oAAAJ). Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the...
Explore schema.org annotation [NewsArticle](https://schema.org/NewsArticle) from CC main crawls or [WDC](http://webdatacommons.org/) to complete the list of news sites/domains used to look for news feeds and sitemaps. The issue is not to...
The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to...