"Line Break"s Break The Markdown & Renderer

Open ahmad2702 opened this issue 2 years ago • 0 comments

Describe the bug

I'm trying to convert my HTML to markdown using FlexmarkHtmlConverter. But the problem is that my markdown looks not correct when my HTML content contains some HTML line breaks inside of other tags. This affects the rendering of the final DOCX document.

[x] Parser
[x] HtmlRenderer
[ ] Formatter
[ ] FlexmarkHtmlParser
[x] DocxRenderer
[ ] PdfConverterExtension
[ ] extension(s)

To Reproduce Step 1: The following snippet can be used with input_html.txt to convert HTML to Markdown:

public void execute() throws IOException {
	String htmlContent = readFile("input_html.txt");
	String mdContent = htmlToMd(htmlContent);
	writeFile("output.md", mdContent);
}

private static String htmlToMd(String pHtml) {
	FlexmarkHtmlConverter.Builder htmlConBuilder = FlexmarkHtmlConverter.builder();
	FlexmarkHtmlConverter converter = htmlConBuilder.build();
	return converter.convert(pHtml);
}

private String readFile(String pFileName) {
	final List<String> lines = new ArrayList<>();
	try {
		final URL url = this.getClass().getResource("/" + pFileName);
		try (InputStream inputStream = url.openConnection().getInputStream();
				BufferedReader in = new BufferedReader(new InputStreamReader(inputStream))) {

			String inputLine;
			while ((inputLine = in.readLine()) != null) {
				lines.add(inputLine);
			}
		}
	} catch (IOException e) {
		throw new RuntimeException("The file cannot be read.", e);
	}
	return String.join(System.lineSeparator(), lines);
}

private void writeFile(String pFileName, String pContent) throws IOException {
	Path path = Paths.get(System.getProperty("user.home"), "Desktop", pFileName);
	Files.write(path, pContent.getBytes());
}

The output of this snippet is the following markdown: output.md. In this file, we see that the empty lines are not the same as in the original HTML content.

Step 2: With the markdown from the previous step we can render the DOCX document using the following snippet:

public void render() {
	WordprocessingMLPackage doc = getDocument();

	String md = getMarkdownContent();
	MutableDataSet options = createOptions();

	Parser parser = Parser.builder(options).build();
	DocxRenderer RENDERER = DocxRenderer.builder(options).build();

	Node document = parser.parse(md);
	RENDERER.render(document, doc);
}

private MutableDataSet createOptions() {
	MutableDataSet options = new MutableDataSet();
	options.set(Parser.EXTENSIONS, getParserExtensions());
	options.set(DocxRenderer.SUPPRESS_HTML, true);
	return options;
}

private List<Parser.ParserExtension> getParserExtensions() {
	return Arrays.asList(DefinitionExtension.create(),
			EmojiExtension.create(),
			FootnoteExtension.create(),
			StrikethroughSubscriptExtension.create(),
			InsExtension.create(),
			SuperscriptExtension.create(),
			TablesExtension.create(),
			TocExtension.create(),
			SimTocExtension.create(),
			WikiLinkExtension.create());
}

As result, we will get the following DOCX document: output.docx. In this document, we see that the styles and the line breaks are not the same as in the original HTML.

I also tried something like this: Issue #515. But it doesn't help.

What's the problem? And how can I solve it?

Feb 06 '23 18:02 ahmad2702