issue and suggestions

Open AndyTheFactory opened this issue 2 years ago • 1 comments

Issue by lvyq800 Wed Aug 22 01:40:55 2018 Originally opened as https://github.com/codelucas/newspaper/issues/612

It is an awesome tool I have find till now.

There is a suggestion on dealing with html scripts inside of an article: When there is html code inside of the article, the current result for text is to keep the html as inner-text. My suggestion is to provide options for developers to choose whether to keep it, remove it, or convert it into non-html text.

There is an issue when I use fulltext to deal with the text (contains html script inside of it). article = Article("https://community.thunkable.com/t/uploading-files-to-firebase/1916", keep_article_html=False) article.download() article.parse() article.text # this contents contains html scripts inside of it. I want to remove the html scripts. # suggest to provide options for developers to choose: remove html, keep html, covert html into text just like what article.parse() have done for article.text.

ftext = fulltext(article.text) #I use this function to convert html scripts into text - to simulate article.parse(). However, there hides two problems.

#problem1: there is a tag <input ... /> in the article.text, the tag is completed, however, the fulltext function deals with it as incomplete - seems that an <input ...> would work? #problem2: because of problem1, the fulltext function treat text follows "<input .... />" as tail of the input tag. Then, when returning the text of the input tag, it should have returned text + tail, however, the tail is ignored - only text of the tag is returned.

Based on problem1 and problem2, the ftext lost many thing.

Oct 24 '23 14:10 AndyTheFactory

Comment by lvyq800 Wed Aug 22 04:02:11 2018

in parsers.py, I changed the getText as follows. It temporarily returns the complete text I want. hope it helps.

def getText(cls, node):
    txts = [i for i in node.itertext()]
    if(node.tail):
        tail = "\n" + node.tail
    else:
        tail = ""
    return text.innerTrim(' '.join(txts).strip()) + tail

Oct 24 '23 14:10 AndyTheFactory