readability issues

https issue

cannot parse url with https:// protocol

Some sites that don't work well: Medium, Al Jazeera

15

Medium regularly retrieves no images, and sometimes the article is cut off near the end. e.g.: https://medium.com/@erikdkennedy/7-rules-for-creating-gorgeous-ui-part-1-559d4e805cda Al Jazeera doesn't get the title. e.g.: http://www.aljazeera.com/news/2015/03/isil-fighters-bulldoze-ancient-assyrian-palace-iraq-150305195222805.html

OKNoah

enhancement

How to check if page is readable?

4

I want to check if the page is readable or not. Is that possible?

Feelnoobskill

About jsdom usage

From README: > This lib is using jsdom to parse HTML instead of cheerio because some data such as image size and element visibility isn't able to acquire when using...

wong2

关于返回的article对象。

1

我正在做一个网络书签。我的需求是只想要article.title的内容。请问，能否通过配置，我拿到title内容之后，他就不再去获取网络内容。因为我见过测试，有些需要四五秒之后才将article内容获取完毕，这个时间对我来说有点长。

luchenqun

Consider remove or compress example images from repo?

4

https://github.com/ruguoapp/readability/tree/master/examples these two images are relatively large, which makes clone of this repo slow

wong2

textBody returns only the first content element

textBody is limited to the first element found in content. Is it on purpose ? src/readability.js ```javascript Readability.prototype.getTextBody = function(notDeprecated) { […] var rootElement = articleContent.childNodes[0]; ```

clement-plancq

Synchronous functionality when passing raw html

2

There is no reason the parser should run asynchronously when passing raw html as there is no request to be made. Usage should be something like... ```javascript const readability =...

krazyjakee

标签属性去除不干净

示例：http://blog.csdn.net/levy_cui/article/details/51481306 去除后Content仍包含``以及``: ``` 架构基于行块分布函数的通用网页正文抽取 http://wenku.baidu.com/link?url=TOBoIHWT_k68h5z8k_Pmqr-wJMPfCy2q64yzS8hxsgTg4lMNH84YVfOCWUfvfORTlccMWe5Bd1BNVf9dqIgh75t4VQ728fY2Rte3x3CQhaS 网页正文及内容图片提取算法 http://www.jianshu.com/p/d43422081e4b 这一算法的主要原理基于两点：正文区密度：在去除HTML中所有tag之后，正文区字符密度更高，较少出现多行空白；行块长度：非正文区域的内容一般单独标签（行块）中较短。测试源码： https://github.com/rainyear/cix-extractor-py/blob/master/extractor.py#L9 #! /usr/bin/env python3 # -*- coding: utf-8 -*- import requests as req import re DBUG = 0 reBODY =...

c4ys

can't read douban notes

3

such as https://www.douban.com/note/602333108/

rupertqin

readability
readability copied to clipboard

Metadata

https issue

Some sites that don't work well: Medium, Al Jazeera

How to check if page is readable?

About jsdom usage

关于返回的article对象。

Consider remove or compress example images from repo?

textBody returns only the first content element

Synchronous functionality when passing raw html

标签属性去除不干净

can't read douban notes

← Metadata

Owner

Metadata

readability readability copied to clipboard

Metadata

← Metadata

Owner

Metadata

readability
readability copied to clipboard