PixivUtil2
PixivUtil2 copied to clipboard
Replace lxml with html.parser
I didn't notice that lxml required an external dependency. Rather than add it, I've just replaced it with the native html.parser, apparently it's slightly slower but shouldn't matter for this.
ah, it is was old code, should have used "html5lib"
instead like below code.
https://github.com/Nandaka/PixivUtil2/blob/master/PixivBrowserFactory.py#L264
Ok switched to html5lib instead.
I've also added some extra steps to try and preserve some of the text data.
Without Stripping Tags:
[初音ミクシンフォニー2021]公式パンフレットにて鏡音リン・レンのイラストを描かせていただきました。<br />改めましてKAITOさん15周年おめでとうございます!<br /><a href="/jump.php?https%3A%2F%2Fsp.wmg.jp%2Fmikusymphony%2F" target="_blank">https://sp.wmg.jp/mikusymphony/</a>
Original:
[初音ミクシンフォニー2021]公式パンフレットにて鏡音リン・レンのイラストを描かせていただきました。改めましてKAITOさん15周年おめでとうございます!https://sp.wmg.jp/mikusymphony/
New:
[初音ミクシンフォニー2021]公式パンフレットにて鏡音リン・レンのイラストを描かせていただきました。 改めましてKAITOさん15周年おめでとうございます!https://sp.wmg.jp/mikusymphony/ (https://sp.wmg.jp/mikusymphony/)
It's not a great example because in this case the HREF tag and contents were the same, but it helps preserve data in cases where its not such as Click Here to go to my site.
Original:
Click Here to go to my site
New:
Click Here to go to my site (https://important.site)
Wait this won't work if there's more than one link... I will do more testing and update later this week...