turndown icon indicating copy to clipboard operation
turndown copied to clipboard

Code Fencing Not Retaining Language

Open jcrawford opened this issue 5 years ago • 3 comments

I am using jstranformer with the markdown transformer and it creates a code block such as

<pre><code class='lang-js'>code..</code></pre>

The issue I am having is that when converted to markdown it is simply converting without retaining the language name. I have noticed that if i change my html source to be language-js then it will work just fine. Is there a way to tell turndown that it should look for lang-js and not language-js?

jcrawford avatar Jan 29 '19 21:01 jcrawford

language- prefix is mentioned in the CommonMark spec: https://spec.commonmark.org/0.28/#fenced-code-blocks. However it does state: "this spec does not mandate any particular treatment of the info string", so we could make this an option (fencedCodeBlockInfoStringPrefix?)

In the meantime, you could add your own rule. Following on from the usage guide:

turndownService.addRule('fencedCodeBlock', {
  filter: function (node, options) {
    return (
      options.codeBlockStyle === 'fenced' &&
      node.nodeName === 'PRE' &&
      node.firstChild &&
      node.firstChild.nodeName === 'CODE'
    )
  },

  replacement: function (content, node, options) {
    var className = node.firstChild.getAttribute('class') || ''
    var language = (className.match(/lang-(\S+)/) || [null, ''])[1]

    return (
      '\n\n' + options.fence + language + '\n' +
      node.firstChild.textContent +
      '\n' + options.fence + '\n\n'
    )
  }
})

domchristie avatar Jan 29 '19 22:01 domchristie

I agree it would be nice to have this out of the box, mainly for when one uses the "online service" Paste to Markdown (not sure if it is the official web frontend or not)

AntonioGHub avatar Mar 28 '19 08:03 AntonioGHub

In our custom rule, we configure this using an array of regexps, where capture group 1 of the first matched regexp is expected to give the resulting info string.

From our config:

const defaultOptions = {
  // ...
  codeClassPatterns: [
    /^(?:language|code)-(\S+)$/, // CommonMark spec and JIRA
    /^(\S+)\s+syntaxhl$/, // Redmine
  ],
  // ...
}

Evaluation is then as brief as:

function codeLanguageFromClassName(className, options) {
  return options.codeClassPatterns.reduce((match, regexp) => (
    match || (className.match(regexp) || [null, null])[1]
  ), null);
}

The great advantage is that a single Turndown instance/config can cover several different HTML sources and we are not limited to suffixes (see the pattern for Redmine). Which would be probably good for Paste to Markdown as well.

I can incorporate this if @domchristie likes it. The performance impact should be negligible and added complexity justifiable IMO.

martincizek avatar Dec 08 '20 17:12 martincizek