marked icon indicating copy to clipboard operation
marked copied to clipboard

Question: Using marked lexer/parser in a highlighting pipeline?

Open joshgoebel opened this issue 2 years ago • 16 comments

I'm the current maintainer of Highlight.js. I'm posting this as a question because I'd like feedback on whether this is a good idea or not or if there are any big gotchas I'm not thinking of... I'm not sure we want to increase the dependencies of the core library, but perhaps we could experiment with this idea via a highlightjs-markdown 3rd party grammar, etc...


Describe the feature

I've been long considering the idea of allowing some grammars to take advantage of actual parsers for languages (rather than just a bunch of discrete regex rules)... for example when Highlight.js goes to highlight a Markdown file one might imagine the process looking a bit like this:

  • call out to marked to lex/parse the Markdown into tokens/blocks
  • provide a custom marked Renderer that generates an internal Highlight.js TokenTree from the parsed marked tokens
  • finish with our own standard TokenTree#toHTML() which generates HTML output

This would mean that instantly our highlighting of Markdown would gain all the fidelity and precision offered by the Marked parsing engine... much increased accuracy in exchange for larger download size (marked is larger than our regex grammar rules).

Note: I'm not talking about Marked using hightlight.js to help render... I'm taking about Highlight.js using marked to help highlight markdown files...

Why is this feature necessary?

To improve highlighting of grammars that can't be fully expressed with simple regex rules alone.

The issue that led to my posting this: highlightjs/highlight.js#3519

But there have been many similar issues in the past... Markdown is truly hard/impossible to get right using our own internal grammar engine because it's not really super-context aware, and writing super context-aware grammars gets messy very, very fast.

Describe alternatives you've considered

More and more gnarlier regex...

joshgoebel avatar Apr 14 '22 20:04 joshgoebel

One related question: Does marked use look-behinds in any of it's regex? (therefore breaking on Safari)

joshgoebel avatar Apr 14 '22 20:04 joshgoebel

I don't believe we use any look-behinds.

That would be doable

marked.lex(`
# heading

[**bold** *text*](link.html)
`);

will return tokens like:

[
	{type:"heading", raw:"# heading\n\n", depth:1, text:"heading", tokens:[
  	{type:"text", raw:"heading", text:"heading"}
	]},
	{type:"paragraph", raw:"[**bold** *text*](link.html)", text:"[**bold** *text*](link.html)", tokens:[
  	{type:"link", raw:"[**bold** *text*](link.html)", href:"link.html", title:null, text:"**bold** *text*", tokens:[
    	{type:"strong", raw:"**bold**", text:"bold", tokens:[
      	{type:"text", raw:"bold", text:"bold"}
			]},
    	{type:"text", raw:" ", text:" "},
    	{type:"em", raw:"*text*", text:"text", tokens:[
      	{type:"text", raw:"text", text:"text"}
			]}
		]}
	]}
]

We could find a way to translate that into highlight.js tokens.

UziTech avatar Apr 14 '22 20:04 UziTech

I'm a bit confused seeing the text repeated multiple times but would a good rule be to ignore the parent text attribute in cases where there are tokens children nodes? Is all the actual verbatim text content going to get dumped into a type:text node at the bottom of the tree eventually?

I suppose I could just look at your own rendering code to see how it's handling that. :)

joshgoebel avatar Apr 14 '22 21:04 joshgoebel

Oh actually we'd have to be more careful since we'd have to look at raw too... rebuilding the markdown might result in some weird edge cases, so we'd really want the scopes AND the raw text... that might make it a bit harder I think.

joshgoebel avatar Apr 14 '22 21:04 joshgoebel

@UziTech Is there any option to make the parser/lexer spit out more context, such as the position (index) of tokens in the original source string?

joshgoebel avatar Apr 14 '22 21:04 joshgoebel

No the position is not saved but could be figured out from the raw text. It is a little difficult because some text is converted right away (like \r\n to \n and tabs to 4 spaces) so raw isn't exactly what is sent in but it is as close as we can get.

What would the ideal highlight.js tokens look like for that markdown?

UziTech avatar Apr 15 '22 02:04 UziTech

some text is converted right away

Are there any options to disable/configure that? (or is it required for the lexer?)

Screen Shot 2022-04-14 at 10 49 04 PM

(I added emphasis inside the header)

I'm now thinking the simplest thing (if we had reliable start/end indexes - or could generate them easily based on walking the tree and examining raw) might be to just dynamically insert the scopes on our end based on the start/stop positions (which we'd need from marked)... and then for anything we want to "rewrite" significantly it gets tricker... you'll see that we think of links entirely differently, as a string and url/link component... with the string component not being further processed.

joshgoebel avatar Apr 15 '22 03:04 joshgoebel

Right now though I'm unsure which raw components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:

> require('marked').lexer('> I am using marked.')
[
  {
    type: "blockquote",
    raw: "> I am using marked.",
    tokens: [
      {
        type: "paragraph",
        raw: "I am using marked.",
        text: "I am using marked.",
        tokens: [
          {
            type: "text",
            raw: "I am using marked.",
            text: "I am using marked."
          }
        ]
      }
    ]
  },
  links: {}
]

The same content "I am using marked." is repeated 3 times in raw... and the top-most node also includes the prefix "> " for the block quote... we'd need to know that block quote started at index 0 and that paragraph started at position 2 (or 3, etc, it would depend in the whitespace I imagine)...

I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the >... prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leverage marked with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.

Might be time to dig into the lexer source and poke around.

joshgoebel avatar Apr 15 '22 03:04 joshgoebel

Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens. I'll try to write a poc tonight.

UziTech avatar Apr 15 '22 03:04 UziTech

Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens.

Oh yes, I'm not super worried about that part (I just wonder if there are any harder ones I'm not thinking of)... and for many tokens all we need is to wrap the text in a block (based on start/stop index)... some we really don't are about at all (paragraphs, etc)...

I'll try to write a poc tonight.

That'd be awesome, though I wasn't necessarily asking for anyone to do the work - I was just trying to flesh out how feasible of an approach this is. The key thing is if we have 43,838 bytes in we need the same 43,838 bytes back out (just with HTML inserted for visual styling)... since we're just highlighting the raw code, we're not doing any rendering of the Markdown.

So my original idea of us providing a Renderer wouldn't work at all. One example: The rendered API has a hr() method, but it doesn't seem to get access to the raw text. So we need to not only know "a HR goes here" but we need to know whether the original raw text was ===\n or =========\n (or some other variant).

Of course you probably already realize that since you jumped straight to the lexer output rather than taking about the Renderer.

joshgoebel avatar Apr 15 '22 03:04 joshgoebel

This is our internal "emitter" API that builds the token tree on our side: https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104

I had originally hoped we could just walk the lexer tree and then make calls into the emitter as we went. (well, second hope after realizing we can' just be a Renderer)

joshgoebel avatar Apr 15 '22 03:04 joshgoebel

since we replace some things before tokenization it wouldn't be possible to get a complete byte for byte transformation but I think we can get close enough for the result to be usable. If we use a custom extension we get access to the whole token in the renderer functions.

The one thing I am thinking might be difficult is if we want to color things sometimes but not others. (e.g tokenize bold text except when it is inside link text.) The renderer doesn't get the information about parents.

UziTech avatar Apr 15 '22 03:04 UziTech

The renderer doesn't get the information about parents.

Yeah, I'm thinking just using the lex output directly might be simpler since it has all that information... for example for a link we'd probably just take the raw attribute and quickly process that to spit out the string and link portions.

I'm looking at the lexer now... \r\n is already known to be bad mojo with Highlight.js so the \n replacement is something we could probably live with (and may even enforce in the future anyways)... so I think that just leaves the tabs hmmmm.

joshgoebel avatar Apr 15 '22 04:04 joshgoebel

the browser doesn't distinguish between tabs and spaces anyway and since the goal of highlight.js is to produce html that should be rendered by a browser I don't know if that would be a big deal.

Here is a POC just to show how our renderer could be used to output the html for highlight.js I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.

It turned out to be easy to only render emphasis in the header and not in the link text. Just don't call parseInline on the tokens for the link.

// marked-highlight.js

export const highlight = {
  extensions: [
    {
      name: 'heading',
      level: 'block',
      renderer(token) {
        const match = token.raw.match(/^(#+ +).+(\n+)$/);
        const newlines = '\n<br />'.repeat(match[2].length);
        const text = this.parser.parseInline(token.tokens);
        return `<span class="hljs-section">${match[1]}${text}</span>${newlines}`;
      }
    },
    {
      name: 'paragraph',
      level: 'block',
      renderer(token) {
        return this.parser.parseInline(token.tokens);
      }
    },
    {
      name: 'em',
      level: 'inline',
      renderer(token) {
        return `<span class="hljs-emphasis">${token.raw}</span>`;
      }
    },
    {
      name: 'strong',
      level: 'inline',
      renderer(token) {
        return `<span class="hljs-strong">${token.raw}</span>`;
      }
    },
    {
      name: 'link',
      level: 'inline',
      renderer(token) {
        return `\n[<span class="hljs-string">${token.text}</span>](<span class="hljs-link">${token.href}</span>)`;
      }
    }
  ]
};
import { marked } from 'marked';
import { highlight } from './marked-highlight.js';

marked.use(highlight);

console.log(marked(`
# heading *emphasis*

[**bold** *text*](link.html)
`));

output:

<span class="hljs-section"># heading <span class="hljs-emphasis">*emphasis*</span></span>
<br />
<br />
[<span class="hljs-string">**bold** *text*</span>](<span class="hljs-link">link.html</span>)

UziTech avatar Apr 15 '22 04:04 UziTech

I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.

Do you know if there is a way to get non-text out of the pipeline? I turned all the return values into objects btu all I get at the end is a string of:

[object Object][object Object][object Object]

joshgoebel avatar Jul 14 '22 14:07 joshgoebel

The marked parser is made to output HTML in string format but you don't have to use the output. The renderer functions could instead populate some object and output an empty string.

UziTech avatar Jul 14 '22 16:07 UziTech

Right now though I'm unsure which raw components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:

> require('marked').lexer('> I am using marked.')
[
  {
    type: "blockquote",
    raw: "> I am using marked.",
    tokens: [
      {
        type: "paragraph",
        raw: "I am using marked.",
        text: "I am using marked.",
        tokens: [
          {
            type: "text",
            raw: "I am using marked.",
            text: "I am using marked."
          }
        ]
      }
    ]
  },
  links: {}
]

The same content "I am using marked." is repeated 3 times in raw... and the top-most node also includes the prefix "> " for the block quote... we'd need to know that block quote started at index 0 and that paragraph started at position 2 (or 3, etc, it would depend in the whitespace I imagine)...

I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the >... prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leverage marked with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.

Might be time to dig into the lexer source and poke around.

It's really a problem, and I have no idea about how to put lex result back to the content, what I am focusing on will only be the parts which are not code, rather than everything. Even lex can return all information (like index) we needed for customizing the content, I still think we might need to execute the rendering tasks several times, if I want to handle different types of content, so that it definitely will slow the site.

@UziTech Is there any way we can custom the pipeline and do a step-by-step processing for the content? The best will be that we can even custom the tasks order in pipeline.

For example:

```code```\n\n<!-- commnents --> here\n\n```<!-- commnents -->```\n# <!-- commnents --> in heading `code`\n
marked(content, [
'code',
'custom1',
'heading',
'custom2',
]);
  1. Code step (Should only use code rules to handle the content):
<pre><code>code</pre></code>\n\n<!-- commnents --> here\n\n<pre></code><!-- commnents --><pre></code>\n# <!-- commnents --> in heading <code>code</code>\n
  1. custom parser – Comment Replacer (can use Marked's render or render content as you wish)
// Replace '<!-- custom commnents -->' into 'text' (will ignore the code block):
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n# text in heading <code>code</code>\n
  1. heading step
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1>text in heading <code>code</code></h1>\n
  1. custom parser – Add custom links to heading
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1 href='#text-1024'>text in heading <code>code</code></h1>\n
  1. return result

I don't think I need to open a new issue for what I said above, if you prefer me to do that, let me know.

scruel avatar Sep 27 '22 06:09 scruel