python-readability
python-readability copied to clipboard
fast python port of arc90's readability tool, updated to match latest readability.js!
When parsing a simple text such as " my emphasis sentence", Document.summary() insert a paragraph before the opening . This seems to open mostly when the text is not already...
Document.summary() of github pages is always: "You can’t perform that action at this time" This doesn’t happen with other forges (gitlab, gittea, …)
Between 0.3.0.6 and current release python-readability aggressively removes all images embedded in the html. There doesn't seem to be a way to control this behaviour.
Like .summary() but plain text instead of the .summary() html version. E.g. as a new method or as an argument for the .summary() method. That would be very useful for...
Got the following error: ` File "/usr/local/lib/python3.8/site-packages/readability/readability.py", line 138, in __init__ self.positive_keywords = compile_pattern(positive_keywords) File "/usr/local/lib/python3.8/site-packages/readability/readability.py", line 80, in compile_pattern elif isinstance(elements, re._pattern_type): AttributeError: module 're' has no attribute '_pattern_type'`...
Hi, the bug listed here is related to `readability-lxml`: https://github.com/adbar/trafilatura/issues/43 In the following the first p-element (_Mit dem KfW-Unternehmerkredit..._) is missing from the output: ``` html_fragment = '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\n\t\n\nMit dem KfW-Unternehmerkredit...
For example: https://medium.com/faun/apache-spark-on-kubernetes-docker-for-mac-2501cc72e659
Thanks for your work maintaining and keeping this very usefull library up to date. **Could you please add a line to the README front page, describing the correct invocation of...
Is there a way to extract images so they become in-line instead of being linked?
Hi, a user run into this bug: https://github.com/adbar/trafilatura/issues/21 There are links which end up being orphans between paragraphs, which messes up text rendering and conversion. The problem comes from the...