news-crawl issues

Results 15 news-crawl issues

Sort by recently updated

Use wikidata to complete seeds

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) could be a replacement to complete...

sebastian-nagel

produce WET files?

I'm not sure if this is the right place to ask this, (feel free to direct me where) But would it be possible to also produce WET files from this...

chris-ha458

Avoid following advertisements in news feeds and sitemaps

See also [this discussion on Common Crawl's user group](https://groups.google.com/g/common-crawl/c/CKPaaMCga6Y/m/xhoAkC9oAAAJ). Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the...

sebastian-nagel

Explore schema.org annotations for seed completions

Explore schema.org annotation [NewsArticle](https://schema.org/NewsArticle) from CC main crawls or [WDC](http://webdatacommons.org/) to complete the list of news sites/domains used to look for news feeds and sitemaps. The issue is not to...

sebastian-nagel

Consider archiving of news feeds and sitemaps

The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to...

sebastian-nagel

news-crawl
news-crawl copied to clipboard

Metadata

Use wikidata to complete seeds

produce WET files?

Avoid following advertisements in news feeds and sitemaps

Explore schema.org annotations for seed completions

Consider archiving of news feeds and sitemaps

← Metadata

Owner

Metadata

news-crawl news-crawl copied to clipboard

Metadata

Use wikidata to complete seeds

produce WET files?

Avoid following advertisements in news feeds and sitemaps

Explore schema.org annotations for seed completions

Consider archiving of news feeds and sitemaps

← Metadata

Owner

Metadata

news-crawl
news-crawl copied to clipboard