html2md icon indicating copy to clipboard operation
html2md copied to clipboard

Fenced code blocks with multiple children (spans from highligters) not converted

Open sanzoghenzo opened this issue 1 year ago • 3 comments

Hi, first of all many thanks for your work, I'm using this library in my android app and it's working really well!

Unfortunately, a user of the app opened an issue because some code blocks in a webpage don't get converted: only the first line is displayed.

That specific webpage was created with Jekyll from a markdown source, so I'm expecting that many other websites could be affected.

This is an excerpt from the page:

<div class="language-python highlighter-rouge">
  <div class="highlight">
<pre class="highlight"><code><span class="c1">#! /usr/bin/env python3
</span>
<span class="kn">import</span> <span class="nn">tika</span>
<span class="kn">from</span> <span class="nn">tika</span> <span class="kn">import</span> <span class="n">parser</span>

<span class="n">fileIn</span> <span class="o">=</span> <span class="s">"berk011veel01_01.epub"</span>
<span class="n">fileOut</span> <span class="o">=</span> <span class="s">"berk011veel01_01.txt"</span>

<span class="n">parsed</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">fileIn</span><span class="p">)</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">parsed</span><span class="p">[</span><span class="s">"content"</span><span class="p">]</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fileOut</span><span class="p">,</span> <span class="s">'w'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</code></pre>
  </div>
</div>

The issue is at this line: only the first child of the code tag is read.

sanzoghenzo avatar Jan 28 '24 12:01 sanzoghenzo

I tried with a custom rule, and replacing the incriminated line with childNodes().map((e) => e.textContent).join() renders all the code.

I'm not sure how to solve the language identification, that info appears two divs up instead of in the first children. I don't know if it a standard of some kind or a particular case of this website (it uses jekyll and a bootstrap theme).

sanzoghenzo avatar Jan 28 '24 12:01 sanzoghenzo

Hi sanzoghenzo, I don't think that's a standard for code blocks.

In the custom rule, you can get the parents or maybe any elements you want, by using the dom api. For example(may not work): node.asElement()?.parent?.parent?.classes;

jarontai avatar Feb 02 '24 12:02 jarontai

Thanks @jarontai, your example missed a parent, but it was very helpful!

I've turned it into a generic "walker" of all the parents, I'll leave it here for posterity:

String getLanguage(node) {
  var regex = RegExp(r'language-(\S+)');
  var className = node.firstChild!.className;
  var languageMatched = regex.firstMatch(className)?.group(1);
  if (languageMatched != null) {
    return languageMatched;
  }
  var nodeElement = node.asElement();
  while (nodeElement.parent != null) {
    nodeElement = nodeElement.parent;
    for (var className in nodeElement.classes) {
      languageMatched = regex.firstMatch(className)?.group(1);
      if (languageMatched != null) {
        return languageMatched;
      }
    }
  }
  return '';
}

sanzoghenzo avatar Feb 05 '24 21:02 sanzoghenzo