flexmark-java
flexmark-java copied to clipboard
HTML to Markdown: unreliable override of renderer handlers
Describe the bug
When overriding a Markdown renderer handler by setting the UNWRAPPED_TAGS option, the override is sometimes applied and sometimes not.
- [ ]
Parser - [ ]
HtmlRenderer - [ ]
Formatter - [ ]
FlexmarkHtmlParser - [ ]
DocxRenderer - [ ]
PdfConverterExtension - [ ] extension(s)
- [x]
FlexmarkHtmlConverter
To Reproduce
In the following test, we convert HTML with a definition list to Markdown. As the definition list contains div tag, it is not converted correctly by the default renderer handler in FlexmarkHtmlConverter. I override the handler using the UNWRAPPED_TAGS option, such that the tags dl, dt, and dd get processed in a generic way.
The runs the convertion 10000 times and prints how many times it was correct and incorrect.
@Test
public void testMarkdownDefinitionList() {
String markdown;
int correct = 0;
int incorrect = 0;
DataHolder flexmarkOptions = new MutableDataSet()
.set(UNWRAPPED_TAGS, new String[] { "article", "address", "frameset", "section", "small", "iframe",
"dl", "dt", "dd", })
.toImmutable();
FlexmarkHtmlConverter converter = FlexmarkHtmlConverter.builder(flexmarkOptions).build();
for (int i = 0; i < 10000; i++) {
String html = "<dl id=\"definition-list\">\n" +
"<div>\n" +
"<dt></dt>\n" +
"<dd>Data 1</dd>\n" +
"<span>\n" +
"<dd>Data 2</dd>\n" +
"</span>\n" +
"</div>\n" +
"</dl>";
markdown = converter.convert(html);
if (markdown.contains("Data 2")) {
correct++;
} else {
incorrect++;
}
}
System.out.println("correct: " + correct + ", incorrect: " + incorrect);
assertEquals(0, incorrect);
}
Expected behavior
The test should be successful.
Resulting Output
The test fails and shows a similar number of correct and incorrect conversions.
Additional context
It seems, however I haven't had time to confirm, that this issue may be caused by storing the Markdown renderer handlers in a Set instead of a List (HtmlConverterCoreNodeRenderer.java:66). And then, the following code in FlexmarkHtmlConverter.java:
Set<HtmlNodeRendererHandler<?>> formattingHandlers = htmlNodeRenderer.getHtmlNodeRendererHandlers();
if (formattingHandlers == null) continue;
for (HtmlNodeRendererHandler<?> nodeType : formattingHandlers) {
// Overwrite existing renderer
renderers.put(nodeType.getTagName(), nodeType);
}
.. would pick elements from formattingHandlers in a random way and sometimes fail to override the handler that we wanted to override.
The bug is fixed in my pull request. It comes with an unit test, but I'm not sure how to integrate it with the existing AsciiDoc format (found no way to repeat a single test an arbitrary number of times). I'm open to suggestions if there is an easy fix :)