news-crawl issues

Results 15 news-crawl issues

Sort by recently updated

Run docker in a non-interactively way

Hi, I've been trying to run docker in a non-interactively using the following command ``` docker run -d \ -p 127.0.0.1:9200:9200 -p 5601:5601 -p 8080:8080 \ -v .../data/warc:/data/warc \ -v...

scarcenine

Do not use "http/2" protocol version in HTTP headers in WARC files

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-January/011042.html) which included [ALPN](https://en.wikipedia.org/wiki/Application-Layer_Protocol_Negotiation) and therefor allowed...

sebastian-nagel

Allow to follow news sites not providing RSS/Atom feed or news sitemap

The news crawler (as of now) relies exclusively on [RSS](https://en.wikipedia.org/wiki/RSS)/[Atom](https://en.wikipedia.org/wiki/Atom_(Web_standard)) feeds and [news sitemaps](https://en.wikipedia.org/wiki/Sitemaps#Google_News_Sitemaps) to find links to news articles. However, some news sites do not provide feeds or sitemaps....

sebastian-nagel

enhancement

Automatic removal of ephemeral sitemaps

If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing...

sebastian-nagel

enhancement

NewsSiteMapParserBolt: do not detect feeds as sitemaps

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One...

sebastian-nagel

Add HTTP protocol version to HTTP request message

The request records in the CC-NEWS WARC files lack the HTTP protocol version: ``` GET /path ``` instead of ``` GET /path HTTP/1.1 ``` This makes some WARC parsers fail...

sebastian-nagel

Check cross-submits for sitemaps

Sitemaps are automatically detected in the robots.txt but not checked for [cross-submits](https://www.sitemaps.org/protocol.html#sitemaps_cross_submits). From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via...

sebastian-nagel

news-crawl
news-crawl copied to clipboard

Metadata

Run docker in a non-interactively way

Do not use "http/2" protocol version in HTTP headers in WARC files

Allow to follow news sites not providing RSS/Atom feed or news sitemap

Automatic removal of ephemeral sitemaps

NewsSiteMapParserBolt: do not detect feeds as sitemaps

Add HTTP protocol version to HTTP request message

Check cross-submits for sitemaps

Extract publishing date

Improve feed parser robustness

Port topology and resources to StormCrawler 2.10

← Metadata

Owner

Metadata

news-crawl news-crawl copied to clipboard

Metadata

← Metadata

Owner

Metadata

news-crawl
news-crawl copied to clipboard