readability
readability copied to clipboard
📚 Turn any web page into a clean view
cannot parse url with https:// protocol
Medium regularly retrieves no images, and sometimes the article is cut off near the end. e.g.: https://medium.com/@erikdkennedy/7-rules-for-creating-gorgeous-ui-part-1-559d4e805cda Al Jazeera doesn't get the title. e.g.: http://www.aljazeera.com/news/2015/03/isil-fighters-bulldoze-ancient-assyrian-palace-iraq-150305195222805.html
I want to check if the page is readable or not. Is that possible?
From README: > This lib is using jsdom to parse HTML instead of cheerio because some data such as image size and element visibility isn't able to acquire when using...
我正在做一个网络书签。我的需求是只想要article.title的内容。请问,能否通过配置,我拿到title内容之后,他就不再去获取网络内容。因为我见过测试,有些需要四五秒之后才将article内容获取完毕,这个时间对我来说有点长。
https://github.com/ruguoapp/readability/tree/master/examples these two images are relatively large, which makes clone of this repo slow
textBody is limited to the first element found in content. Is it on purpose ? src/readability.js ```javascript Readability.prototype.getTextBody = function(notDeprecated) { […] var rootElement = articleContent.childNodes[0]; ```
There is no reason the parser should run asynchronously when passing raw html as there is no request to be made. Usage should be something like... ```javascript const readability =...
示例:http://blog.csdn.net/levy_cui/article/details/51481306 去除后Content仍包含``以及``: ``` 架构 基于行块分布函数的通用网页正文抽取 http://wenku.baidu.com/link?url=TOBoIHWT_k68h5z8k_Pmqr-wJMPfCy2q64yzS8hxsgTg4lMNH84YVfOCWUfvfORTlccMWe5Bd1BNVf9dqIgh75t4VQ728fY2Rte3x3CQhaS 网页正文及内容图片提取算法 http://www.jianshu.com/p/d43422081e4b 这一算法的主要原理基于两点:正文区密度:在去除HTML中所有tag之后,正文区字符密度更高,较少出现多行空白;行块长度:非正文区域的内容一般单独标签(行块)中较短。 测试源码: https://github.com/rainyear/cix-extractor-py/blob/master/extractor.py#L9 #! /usr/bin/env python3 # -*- coding: utf-8 -*- import requests as req import re DBUG = 0 reBODY =...
such as https://www.douban.com/note/602333108/