HTML to Markdown: unreliable override of renderer handlers

Open p10trk opened this issue 3 years ago • 1 comments

Describe the bug

When overriding a Markdown renderer handler by setting the UNWRAPPED_TAGS option, the override is sometimes applied and sometimes not.

[ ] Parser
[ ] HtmlRenderer
[ ] Formatter
[ ] FlexmarkHtmlParser
[ ] DocxRenderer
[ ] PdfConverterExtension
[ ] extension(s)
[x] FlexmarkHtmlConverter

To Reproduce

In the following test, we convert HTML with a definition list to Markdown. As the definition list contains div tag, it is not converted correctly by the default renderer handler in FlexmarkHtmlConverter. I override the handler using the UNWRAPPED_TAGS option, such that the tags dl, dt, and dd get processed in a generic way.

The runs the convertion 10000 times and prints how many times it was correct and incorrect.

    @Test
    public void testMarkdownDefinitionList() {
        String markdown;
        int correct = 0;
        int incorrect = 0;

        DataHolder flexmarkOptions = new MutableDataSet()
                .set(UNWRAPPED_TAGS, new String[] { "article", "address", "frameset", "section", "small", "iframe",
                        "dl", "dt", "dd", })
                .toImmutable();
        FlexmarkHtmlConverter converter = FlexmarkHtmlConverter.builder(flexmarkOptions).build();

        for (int i = 0; i < 10000; i++) {
            String html = "<dl id=\"definition-list\">\n" +
                    "<div>\n" +
                    "<dt></dt>\n" +
                    "<dd>Data 1</dd>\n" +
                    "<span>\n" +
                    "<dd>Data 2</dd>\n" +
                    "</span>\n" +
                    "</div>\n" +
                    "</dl>";

            markdown = converter.convert(html);

            if (markdown.contains("Data 2")) {
                correct++;
            } else {
                incorrect++;
            }
        }

        System.out.println("correct: " + correct + ", incorrect: " + incorrect);

        assertEquals(0, incorrect);
    }

Expected behavior

The test should be successful.

Resulting Output

The test fails and shows a similar number of correct and incorrect conversions.

Additional context

It seems, however I haven't had time to confirm, that this issue may be caused by storing the Markdown renderer handlers in a Set instead of a List (HtmlConverterCoreNodeRenderer.java:66). And then, the following code in FlexmarkHtmlConverter.java:

                Set<HtmlNodeRendererHandler<?>> formattingHandlers = htmlNodeRenderer.getHtmlNodeRendererHandlers();
                if (formattingHandlers == null) continue;

                for (HtmlNodeRendererHandler<?> nodeType : formattingHandlers) {
                    // Overwrite existing renderer
                    renderers.put(nodeType.getTagName(), nodeType);
                }

.. would pick elements from formattingHandlers in a random way and sometimes fail to override the handler that we wanted to override.

Jun 16 '22 23:06 p10trk

The bug is fixed in my pull request. It comes with an unit test, but I'm not sure how to integrate it with the existing AsciiDoc format (found no way to repeat a single test an arbitrary number of times). I'm open to suggestions if there is an easy fix :)

Jun 24 '22 13:06 p10trk