rust-html2text
rust-html2text copied to clipboard
CSS support for formatting styles
There are situations in which it would be useful for html2text to understand at least a small amount of CSS.
An occasional annoyance I find with some web pages is that they use different classes of <span> (or <div>, depending on preference) for all their formatting, including both paragraph separation and inline style changes such as emphasis. Then they rely on CSS to make some of those span classes behave like <p>, some like <em>, some like <code> and so on.
html2text can't render a document of that kind sensibly without having to speak enough CSS to at least know which classes of <span> it should treat like which normal tags. You end up with a huge megaparagraph, or alternatively no end of spurious newlines (depending on whether the author went all-spans or all-divs).
I don't have a real-world example handy, but here's one I mocked up manually:
<head>
<title>Demo of the 'spans-everywhere' school of HTML</title>
<style type="text/css">
.p { display: block; margin-bottom: 1em; }
.em { font-style: italic; }
.code { font-family: monospace; }
</style>
</head>
<body>
<span class="p">Paragraph one, containing <span class="em">emphasis</span>.</span><span class="p">Paragraph two, containing <span class="code">code</span>.</span>
</body>
</html>
@jugglerchris mentioned that another use case is pages that use display: none.
There is now what could be described as "minimal CSS support"; it includes display: none but not font-style or display:block. So some progress has been made...