python-readability issues

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

5

I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...

frenzymadness

"comment" in unlikely candidates mutilates formatted code blocks

1

I've seen several cases of code blocks using color schemes where "comment" becomes a style and starts getting trimmed. At best this loses all the comments, and may further mutilate...

abeyerpath

Readability of MSN articles

I'm struggling to get this working with MSN news articles. Here's the approach I'm using: ```python def fetch_url(url: str, timeout: int = 10) -> str: """Get the content from a...

rpdelaney

Summary is fooled by a modal popup

```python >>> r = fetch_url('https://www.democracynow.org/2023/9/5/headlines/biden_administration_to_supply_ukraine_with_depleted_uranium_munitions') >>> type(r) >>> doc = Document(r.content) >>> doc.summary() 'Independent news has never been so important.\r\nGet Democracy Now! delivered to your inbox every day! Don\'t worry,...

rpdelaney

Last two commands in the "usage" section are incorrect

These should be enclosed in print() statements

CSC-ASU

Using AdBlock rules to remove elements

13

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly. I'm using this in my own...

bburky

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

4

Hi, I'm getting this warning: > readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead. I'm running...

web64

.text may guess the encoding incorrectly

4

Steps to reproduce: import requests from readability import Document response = requests.get('https://polit.ru/article/2021/09/14/ps_dennet/') print(Document(response.text).summary()) However, if we use `.content`: print(Document(response.content).summary()) everything will be just fine. May be updating README.rst is worth...

097115

isProbablyReaderable

3

How difficult would it be to implement `isProbablyReaderable(doc, options)` (from https://github.com/mozilla/readability#isprobablyreaderabledocument-options). This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed....

Uzay-G

Problems with thecyberwire.com

Take this page, for example: https://thecyberwire.com/newsletters/policy-briefing/4/28: - `doc.summary()` returns only the main text, the first 3 paragraphs, but completely skips the `SELECTED READING` section. Or, take this page: https://thecyberwire.com/newsletters/daily-briefing/11/29 -...

097115

python-readability
python-readability copied to clipboard

Metadata

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

"comment" in unlikely candidates mutilates formatted code blocks

Readability of MSN articles

Summary is fooled by a modal popup

Last two commands in the "usage" section are incorrect

Using AdBlock rules to remove elements

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

.text may guess the encoding incorrectly

isProbablyReaderable

Problems with thecyberwire.com

← Metadata

Owner

Metadata

python-readability python-readability copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-readability
python-readability copied to clipboard