html2text icon indicating copy to clipboard operation
html2text copied to clipboard

AssertionError for img src attribute

Open arvindpdmn opened this issue 4 years ago • 6 comments

Some web pages have errors. Rather than simply throwing an exception, it would be better to ignore benign errors and convert as much of the page as possible.

  • Version by html2text --version: 2020.1.16

  • Python version python --version: Python 3.7.7

  • Test script:

import requests
import html2text

rsp = requests.get('https://blog.logrocket.com/from-rest-to-graphql/')
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.bypass_tables = False
text = h2t.handle(rsp.text)
  • Log:
 File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 142, in handle
    self.feed(data)
  File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 139, in feed
    super().feed(data)
  File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 171, in goahead
    k = self.parse_starttag(i)
  File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 345, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 191, in handle_starttag
    self.handle_tag(tag, dict(attrs), start=True)
  File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\__init__.py", line 502, in handle_tag
    assert attrs["src"] is not None
AssertionError

arvindpdmn avatar Aug 28 '20 13:08 arvindpdmn

BTW, I'm running code in a Jupyter Notebook. So, not sure how to disable asserts.

arvindpdmn avatar Aug 28 '20 13:08 arvindpdmn

Did you find any solution ?

rhlr avatar Dec 02 '20 06:12 rhlr

Nope

arvindpdmn avatar Dec 02 '20 07:12 arvindpdmn

As a workaround, maybe preprocess your bad html with pytidylib?

jeremydouglass avatar Dec 10 '20 20:12 jeremydouglass

here's a minimal reproducer:

>>> import html2text
>>> html2text.html2text('hi <img src> there')
Traceback (most recent call last):
...
  File "/.../python-3.9.4+/lib/python3.9/site-packages/html2text/__init__.py", line 502, in handle_tag
    assert attrs["src"] is not None
AssertionError

wbolster avatar Apr 23 '21 11:04 wbolster

@wbolster and here is the corresponding workaround I mentioned earlier

>>> import html2text
>>> from tidylib import tidy_document
>>> 
>>> document, errors = tidy_document('hi <img src> there')
>>> html2text.html2text(document)
'hi ![]() there\n\n'

jeremydouglass avatar May 12 '21 18:05 jeremydouglass