rust-html2text Decoupling the html2text rendering pipeline

I’ve spent some time using html2text, reading its source code and even writing small patches. Still, I haven’t really grasped the complete rendering process that html2text performs. At the same time, I have some specific requirements like #27 or #36 that cannot be realized with html2text and maybe don’t even belong in a generic HTML rendering library.

Therefore, I am wondering: Would it be possible and would it make sense to decouple the html2text rendering pipeline into steps that can be customized by the user? This would make it easier to understand the rendering process, and it might make it possible to implement some of the requirements I mentioned earlier without having to re-implement the entire rendering stack.

From my point of view, these are the steps of the rendering pipeline (while I’m quite confident that steps 1–3 are correct, I’m not really sure about 4 and 5.):

Parsing the HTML document (src/lib.rs).
Transforming the HTML document into a render tree (src/lib.rs).
Estimating the size of the elements of the render tree (src/lib.rs).
Laying out the elements of the render tree into lines (src/text_renderer.rs?).
Rendering the elements into text (src/text_renderer.rs?).
Annotating the lines using a TextDecorator (src/text_renderer.rs).

It would be especially nice if the user would be able to customize step 5 without having to re-implement everything else.

Is my understanding of the rendering process roughly correct? What do you think?

Oct 03 '20 17:10 robinkrahl

I think that's a reasonable summary of how it works. Some more notes:

The size estimate is needed for laying out tables - i.e. deciding how wide each of the columns should be.
The annotation is part of the text layout - some annotations can add text which needs to be taken into account.
The layout is really just a tree walk of the render tree, but it's harder to follow because tree_map_reduce() is used to avoid stack overflows; it has an explicity stack of work to do rather than the more readable recursion.
The text layout is mostly using the obvious algorithm - keep trying to add words until the line is full, then start a new line. Nested blocks use nested text renderers (e.g. for quoted text, render into a width-2 renderer and the add the lines to the parent with a > prefix).

Can you describe the kind of things you want to do differently?

Oct 03 '20 19:10 jugglerchris

Thanks for the explanations! I’ll have a closer look at the text rendering code.

Can you describe the kind of things you want to do differently?

It’s not about doing things differently, rather about extending the renderer for special use cases like syntax highlighting or special styling for other elements.

Oct 03 '20 19:10 robinkrahl

I had a thought. Perhaps a useful extension point would be the point where a sub-builder is merged into the parent. For example, after a <pre> block is processed a function could have access to the lines before it's integrated into the parent builder. (Note that currently <pre> doesn't use a sub-builder, but it could if needed. Something like <blockquote> does, so that it can format at a smaller width and then prefix them when adding to the current block).

Oct 04 '20 07:10 jugglerchris

I had a thought. Perhaps a useful extension point would be the point where a sub-builder is merged into the parent.

I like the idea! Maybe this could also be realized by adding optional prepare and finalize methods to the decorator that are called before and after the decorator is used.

Oct 04 '20 08:10 robinkrahl

Another aspect to this topic is that it would be useful to use html2text’s layout mechanisms with a different data source, for example a Markdown document parsed with pulldown-cmark instead of an HTML document.

Oct 07 '20 15:10 robinkrahl

That's an interesting thought. Though as Markdown can contain HTML tags, maybe just going via HTML makes sense. I don't know how common that is, though.

Oct 07 '20 18:10 jugglerchris

Right now the render method looks like:

    /// Render this document using the given `decorator` and wrap it to `width` columns.
    pub fn render<D: TextDecorator>(
        self,
        width: usize,
        decorator: D,
    ) -> RenderedText<D> {
        let renderer = TextRenderer::new(width, decorator);
        let builder = render_tree_to_string(renderer, self.0, &mut Discard {});
        RenderedText(builder)
    }

How would you feel about a PR that made this function take in a whole R: Renderer instead of a D: TextDecorator? It looks like the rest of the code is totally generic enough to handle this? This would allow users to render a RenderTree with their own implementations rather than being forced into TextRenderer.

Thanks for the great lib, almost exactly what I needed.

Nov 25 '22 00:11 grantslatton

Hi @grantslatton - sorry I accidentally lost the notification and didn't notice it was about a comment here! I'd be very happy to add a new method taking a full renderer (called by the current render method) - then it's not a breaking change.

Dec 10 '22 10:12 jugglerchris

rust-html2text rust-html2text copied to clipboard

Decoupling the html2text rendering pipeline

rust-html2text
rust-html2text copied to clipboard