python-o365
python-o365 copied to clipboard
Better linebreak parsing with msg.get_body_text()
I regularly extract the text of an html message. The current parsing method (below) fails to insert linebreaks where one would expect them. Is it possible to improve this? I could do this directly in lxml (with itertext), but it might be a good enhancement for the library as a whole.
def get_body_text(self):
""" Parse the body html and returns the body text using bs4
:return: body as text
:rtype: str
"""
if self.body_type.upper() != 'HTML':
return self.body
try:
soup = bs(self.body, 'html.parser')
except RuntimeError:
return self.body
else:
return soup.body.text
This is done by the beautifulsoup4 library. I don't want to add lxml or any other dependency so...
do you have any proposal on how to achieve this?
Would be a really nice enhancement, I am experiencing the same thing. For those coming to this issue, you can try the following:
message = inbox.get_message("<SOME EMAIL ID>")
soup = message.get_body_soup()
delimiter = "\n\n"
for line_break in soup.findAll('br'):
line_break.replaceWith(delimiter)
soup.get_text()
Source: https://stackoverflow.com/a/61423104/7362046