news-crawl icon indicating copy to clipboard operation
news-crawl copied to clipboard

News crawling with StormCrawler - stores content as WARC

Results 15 news-crawl issues
Sort by recently updated
recently updated
newest added

Hi, I've been trying to run docker in a non-interactively using the following command ``` docker run -d \ -p 127.0.0.1:9200:9200 -p 5601:5601 -p 8080:8080 \ -v .../data/warc:/data/warc \ -v...

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-January/011042.html) which included [ALPN](https://en.wikipedia.org/wiki/Application-Layer_Protocol_Negotiation) and therefor allowed...

The news crawler (as of now) relies exclusively on [RSS](https://en.wikipedia.org/wiki/RSS)/[Atom](https://en.wikipedia.org/wiki/Atom_(Web_standard)) feeds and [news sitemaps](https://en.wikipedia.org/wiki/Sitemaps#Google_News_Sitemaps) to find links to news articles. However, some news sites do not provide feeds or sitemaps....

enhancement

If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing...

enhancement

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One...

The request records in the CC-NEWS WARC files lack the HTTP protocol version: ``` GET /path ``` instead of ``` GET /path HTTP/1.1 ``` This makes some WARC parsers fail...

Sitemaps are automatically detected in the robots.txt but not checked for [cross-submits](https://www.sitemaps.org/protocol.html#sitemaps_cross_submits). From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via...

It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to...

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which...

Upgrade Apache Storm, ElasticSearch and Kibana This way the NewsCrawler will benefit from the many bugfixes and improvements provided by these components and make it easier ti add new functionalities...