html2text
html2text copied to clipboard
AssertionError for img src attribute
Some web pages have errors. Rather than simply throwing an exception, it would be better to ignore benign errors and convert as much of the page as possible.
-
Version by
html2text --version
: 2020.1.16 -
Python version
python --version
: Python 3.7.7 -
Test script:
import requests
import html2text
rsp = requests.get('https://blog.logrocket.com/from-rest-to-graphql/')
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.bypass_tables = False
text = h2t.handle(rsp.text)
- Log:
File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 142, in handle
self.feed(data)
File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 139, in feed
super().feed(data)
File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 171, in goahead
k = self.parse_starttag(i)
File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 345, in parse_starttag
self.handle_starttag(tag, attrs)
File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 191, in handle_starttag
self.handle_tag(tag, dict(attrs), start=True)
File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 502, in handle_tag
assert attrs["src"] is not None
AssertionError
BTW, I'm running code in a Jupyter Notebook. So, not sure how to disable asserts.
Did you find any solution ?
Nope
As a workaround, maybe preprocess your bad html with pytidylib?
here's a minimal reproducer:
>>> import html2text
>>> html2text.html2text('hi <img src> there')
Traceback (most recent call last):
...
File "/.../python-3.9.4+/lib/python3.9/site-packages/html2text/__init__.py", line 502, in handle_tag
assert attrs["src"] is not None
AssertionError
@wbolster and here is the corresponding workaround I mentioned earlier
>>> import html2text
>>> from tidylib import tidy_document
>>>
>>> document, errors = tidy_document('hi <img src> there')
>>> html2text.html2text(document)
'hi ![]() there\n\n'