newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

The content extracted by newspape is out of order

Open riusksk opened this issue 1 year ago • 0 comments

When use newspaper to extract articles containing code, the content sequence is incorrect, for example, http://akat1.pl/?id=2

The error is placed in the pass-through() function of mail.local:
<code>

After extraction, it becomes:

<code>
The error is placed in the pass() function of mail.local: 

this bug is exist in convert_to_text() function of outputformatters.py:

    def convert_to_text(self):
        txts = []
        for node in list(self.get_top_node()):  # Bug!!!!
            try:
                txt = self.parser.getText(node)

If you use the following code to output txt, the order is correct ( it just doesn't wrap the line correctly), but if you use the for loop above, it will be out of order. txt = self.parser.getText(self.get_top_node())

riusksk avatar Aug 17 '22 08:08 riusksk