python-readability FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

Open web64 opened this issue 6 years ago • 4 comments

Hi,

I'm getting this warning:

readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

I'm running Python 3.5.2

Cheers!

May 07 '18 14:05 web64

Same here.. Any news on that? What is the thing we have to correct?

Nov 09 '18 18:11 noembryo

Appears to be the :

doc.body or doc

statement

Dec 01 '22 22:12 clach04

I actually was getting bad results, not just warnings (a string containing a repr of a byte buffer). Simple samples code did not have this, only with a real web page. Unclear if related (might warrant a new issue).

Ended up Monkey patching in a hack, still got warning but at least it worked:

from lxml.etree import tostring
import readability
from readability import Document  # https://github.com/buriy/python-readability/   pip install readability-lxml

## monkey patch

def get_body(doc):
    for elem in doc.xpath(".//script | .//link | .//style"):
        elem.drop_tree()
    # tostring() always return utf-8 encoded string
    # FIXME: isn't better to use tounicode?
    print('MY DEBUG')
    #raw_html = str_(tostring(doc.body or doc))
    #raw_html = tostring(doc.body or doc)
    raw_html = tostring(doc.body or doc, encoding='utf-8').decode('utf-8')
    #import pdb ; pdb.set_trace()
    #raw_html = doc.body or doc
    cleaned = readability.cleaners.clean_attributes(raw_html)
    try:
        # BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
        return cleaned
    except Exception:  # FIXME find the equivalent lxml error
        # logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
        return raw_html


def content(self):
    """Returns document body"""
    #return get_body(self._html(True))
    print('MY DEBUG')
    return get_body(self._html(True))

Document.content = content
## monkey patch

Dec 01 '22 22:12 clach04

I was using one line to validate the response of a tag

Jan 06 '23 04:01 Mustafahubs

python-readability python-readability copied to clipboard

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

python-readability
python-readability copied to clipboard