news-crawl icon indicating copy to clipboard operation
news-crawl copied to clipboard

News crawling with StormCrawler - stores content as WARC

Results 15 news-crawl issues
Sort by recently updated
recently updated
newest added

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) could be a replacement to complete...

I'm not sure if this is the right place to ask this, (feel free to direct me where) But would it be possible to also produce WET files from this...

See also [this discussion on Common Crawl's user group](https://groups.google.com/g/common-crawl/c/CKPaaMCga6Y/m/xhoAkC9oAAAJ). Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the...

Explore schema.org annotation [NewsArticle](https://schema.org/NewsArticle) from CC main crawls or [WDC](http://webdatacommons.org/) to complete the list of news sites/domains used to look for news feeds and sitemaps. The issue is not to...

The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to...