document_cluster
document_cluster copied to clipboard
BeautifulSoup used for no reason?
synopses_wiki = open('synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]
synopses_clean_wiki = []
for text in synopses_wiki:
text = BeautifulSoup(text, 'html.parser').getText()
#strips html formatting and converts to unicode
synopses_clean_wiki.append(text)
synopses_wiki = synopses_clean_wiki
It seems the html has already been stripped in synopses_list_wiki.txt, so running the text through BeautifulSoup is pointless? I mention it because BeautifulSoup seems to be slowing things down significantly.