python-readability icon indicating copy to clipboard operation
python-readability copied to clipboard

fast python port of arc90's readability tool, updated to match latest readability.js!

Results 43 python-readability issues
Sort by recently updated
recently updated
newest added

I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...

I've seen several cases of code blocks using color schemes where "comment" becomes a style and starts getting trimmed. At best this loses all the comments, and may further mutilate...

I'm struggling to get this working with MSN news articles. Here's the approach I'm using: ```python def fetch_url(url: str, timeout: int = 10) -> str: """Get the content from a...

```python >>> r = fetch_url('https://www.democracynow.org/2023/9/5/headlines/biden_administration_to_supply_ukraine_with_depleted_uranium_munitions') >>> type(r) >>> doc = Document(r.content) >>> doc.summary() 'Independent news has never been so important.\r\nGet Democracy Now! delivered to your inbox every day! Don\'t worry,...

These should be enclosed in print() statements

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly. I'm using this in my own...

Hi, I'm getting this warning: > readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead. I'm running...

Steps to reproduce: import requests from readability import Document response = requests.get('https://polit.ru/article/2021/09/14/ps_dennet/') print(Document(response.text).summary()) However, if we use `.content`: print(Document(response.content).summary()) everything will be just fine. May be updating README.rst is worth...

How difficult would it be to implement `isProbablyReaderable(doc, options)` (from https://github.com/mozilla/readability#isprobablyreaderabledocument-options). This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed....

Take this page, for example: https://thecyberwire.com/newsletters/policy-briefing/4/28: - `doc.summary()` returns only the main text, the first 3 paragraphs, but completely skips the `SELECTED READING` section. Or, take this page: https://thecyberwire.com/newsletters/daily-briefing/11/29 -...