turndown
turndown copied to clipboard
Anchor tag with image as grandchild produces invalid Markdown
Affected version: 6.0.0
I found a particular case where Turndown will produce incorrect Markdown given the following HTML:
<a href="http://google.com">
<div>
<img src="https://via.placeholder.com/150">
</div>
</a>
It will produce:
[

](http://google.com)
which renders like this: (in GitHub and elsewhere) [
](http://google.com)
This occurs whether I am using linkStyle: 'inlined'
or 'referenced'
.
I have added a working demo on CodeSandbox, including a hacky fix I will be using in my project. https://codesandbox.io/s/turndown-image-link-issue-k12z8?file=/src/App.js
The thing is that div
is a block element, which triggers a paragraph implicitly. Whereas paragraphs are a blocks in Markdown, images are inline and cannot contain a paragraph. So your HTML input represents something not expressible in Markdown.
Consider using span
instead or avoid hyperlink-wrapped blocks in another way.
This project doesn't appear to be designed to work with perfect HTML though, it seems to be capable of taking crappy HTML from WYSIWYG editors and the like and covert it back into nice, pure Markdown, and in fact has been doing so pretty well so far.
I have been using this in my fork of a project to convert WordPress format exports (originally from Squarespace) to markdown, because the exports I got from a Squarespace site for some reason generated a lot of these <a>
s wrapping <div>
s wrapping <img>
s.
Seems worth trying to produce something valid rather than something that is categorically incorrect, which is what the current output is.
@SamLeatherdale did you ever find a workaround for this? I'm running in to the same issue
@SamLeatherdale did you ever find a workaround for this? I'm running in to the same issue
It's been a while, but I believe I ran something like Cheerio over it first and stripped out the extra div.
We did the same as @SamLeatherdale writes on a project a while ago - just with pure DOM. I guess the guys also switched to Cheerio recently. Sharing the relevant snippets:
Helper class:
class DomFilter {
constructor(dom) {
this.dom = dom;
this.window = this.dom.window;
this.document = this.window.document;
}
select(xpath, contextNode = null) {
const ctx = contextNode || this.document;
const arr = [];
const { ORDERED_NODE_ITERATOR_TYPE } = this.window.XPathResult;
const it = this.document.evaluate(xpath, ctx, null, ORDERED_NODE_ITERATOR_TYPE);
let node;
while ((node = it.iterateNext())) {
arr.push(node);
}
return arr;
}
// [...]
// eslint-disable-next-line class-methods-use-this
unwrap(nodes) {
nodes.forEach((node) => {
const parent = node.parentNode;
while (node.firstChild) {
parent.insertBefore(node.firstChild, node);
}
parent.removeChild(node);
});
}
// [...]
}
Then something like:
const dom = ...;
const f = new DomFilter(dom);
f.unwrap(f.select('//a/div'));
// now pass the modified dom to Turndown, no need to convert it to HTML string
Ref this:
Seems worth trying to produce something valid rather than something that is categorically incorrect, which is what the current output is.
It's a little more complicated. We need to introduce Markdown contexts to do it both universally and efficiently. This isolated case has indeed a simpler solution, but I don't like the idea to provide case-by-case fixes for these, especially when there is no universally correct solution. With contexts, users would be able to choose what to do with block elements nested in inline elements. It's simple for a div
without semantics. But for e.g. a table inside a link, something valid means either discarding user's data or just keeping it in HTML (which denies Turndown itself). So in this cases, users have to choose. Now they can only choose what to do with these cases using HTML preprocessing - unfortunately.
And for the record:
This project doesn't appear to be designed to work with perfect HTML though,
True.
it seems to be capable of taking crappy HTML from WYSIWYG editors and the like and covert it back into nice, pure Markdown, and in fact has been doing so pretty well so far.
It's a very welcome side effect and we try to achieve it. Still, the higher priority is not to lose data. You're right it's a topic to work on, as I suggested above.