python-o365 icon indicating copy to clipboard operation
python-o365 copied to clipboard

Better linebreak parsing with msg.get_body_text()

Open johanovic opened this issue 5 years ago • 2 comments

I regularly extract the text of an html message. The current parsing method (below) fails to insert linebreaks where one would expect them. Is it possible to improve this? I could do this directly in lxml (with itertext), but it might be a good enhancement for the library as a whole.

def get_body_text(self):
    """ Parse the body html and returns the body text using bs4

    :return: body as text
    :rtype: str
    """
    if self.body_type.upper() != 'HTML':
        return self.body

    try:
        soup = bs(self.body, 'html.parser')
    except RuntimeError:
        return self.body
    else:
        return soup.body.text

johanovic avatar Jul 15 '20 09:07 johanovic

This is done by the beautifulsoup4 library. I don't want to add lxml or any other dependency so...

do you have any proposal on how to achieve this?

alejcas avatar Jul 20 '20 19:07 alejcas

Would be a really nice enhancement, I am experiencing the same thing. For those coming to this issue, you can try the following:

message = inbox.get_message("<SOME EMAIL ID>")
soup = message.get_body_soup()
delimiter = "\n\n"
for line_break in soup.findAll('br'):
    line_break.replaceWith(delimiter)
soup.get_text()

Source: https://stackoverflow.com/a/61423104/7362046

tylerlittlefield avatar May 17 '24 16:05 tylerlittlefield