python-readability icon indicating copy to clipboard operation
python-readability copied to clipboard

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead.

Open web64 opened this issue 6 years ago • 4 comments

Hi,

I'm getting this warning:

readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

I'm running Python 3.5.2

Cheers!

web64 avatar May 07 '18 14:05 web64

Same here.. Any news on that? What is the thing we have to correct?

noembryo avatar Nov 09 '18 18:11 noembryo

Appears to be the :

doc.body or doc

statement

clach04 avatar Dec 01 '22 22:12 clach04

I actually was getting bad results, not just warnings (a string containing a repr of a byte buffer). Simple samples code did not have this, only with a real web page. Unclear if related (might warrant a new issue).

Ended up Monkey patching in a hack, still got warning but at least it worked:

from lxml.etree import tostring
import readability
from readability import Document  # https://github.com/buriy/python-readability/   pip install readability-lxml

## monkey patch

def get_body(doc):
    for elem in doc.xpath(".//script | .//link | .//style"):
        elem.drop_tree()
    # tostring() always return utf-8 encoded string
    # FIXME: isn't better to use tounicode?
    print('MY DEBUG')
    #raw_html = str_(tostring(doc.body or doc))
    #raw_html = tostring(doc.body or doc)
    raw_html = tostring(doc.body or doc, encoding='utf-8').decode('utf-8')
    #import pdb ; pdb.set_trace()
    #raw_html = doc.body or doc
    cleaned = readability.cleaners.clean_attributes(raw_html)
    try:
        # BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
        return cleaned
    except Exception:  # FIXME find the equivalent lxml error
        # logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
        return raw_html


def content(self):
    """Returns document body"""
    #return get_body(self._html(True))
    print('MY DEBUG')
    return get_body(self._html(True))

Document.content = content
## monkey patch

clach04 avatar Dec 01 '22 22:12 clach04

image

I was using one line to validate the response of a tag

Mustafahubs avatar Jan 06 '23 04:01 Mustafahubs