marked
marked copied to clipboard
Question: Using marked lexer/parser in a highlighting pipeline?
I'm the current maintainer of Highlight.js. I'm posting this as a question because I'd like feedback on whether this is a good idea or not or if there are any big gotchas I'm not thinking of... I'm not sure we want to increase the dependencies of the core library, but perhaps we could experiment with this idea via a highlightjs-markdown
3rd party grammar, etc...
Describe the feature
I've been long considering the idea of allowing some grammars to take advantage of actual parsers for languages (rather than just a bunch of discrete regex rules)... for example when Highlight.js goes to highlight a Markdown file one might imagine the process looking a bit like this:
- call out to
marked
to lex/parse the Markdown into tokens/blocks - provide a custom
marked
Renderer that generates an internal Highlight.jsTokenTree
from the parsedmarked
tokens - finish with our own standard
TokenTree#toHTML()
which generates HTML output
This would mean that instantly our highlighting of Markdown would gain all the fidelity and precision offered by the Marked parsing engine... much increased accuracy in exchange for larger download size (marked
is larger than our regex grammar rules).
Note: I'm not talking about Marked using hightlight.js
to help render... I'm taking about Highlight.js using marked
to help highlight markdown files...
Why is this feature necessary?
To improve highlighting of grammars that can't be fully expressed with simple regex rules alone.
The issue that led to my posting this: highlightjs/highlight.js#3519
But there have been many similar issues in the past... Markdown is truly hard/impossible to get right using our own internal grammar engine because it's not really super-context aware, and writing super context-aware grammars gets messy very, very fast.
Describe alternatives you've considered
More and more gnarlier regex...
One related question: Does marked use look-behinds in any of it's regex? (therefore breaking on Safari)
I don't believe we use any look-behinds.
That would be doable
marked.lex(`
# heading
[**bold** *text*](link.html)
`);
will return tokens like:
[
{type:"heading", raw:"# heading\n\n", depth:1, text:"heading", tokens:[
{type:"text", raw:"heading", text:"heading"}
]},
{type:"paragraph", raw:"[**bold** *text*](link.html)", text:"[**bold** *text*](link.html)", tokens:[
{type:"link", raw:"[**bold** *text*](link.html)", href:"link.html", title:null, text:"**bold** *text*", tokens:[
{type:"strong", raw:"**bold**", text:"bold", tokens:[
{type:"text", raw:"bold", text:"bold"}
]},
{type:"text", raw:" ", text:" "},
{type:"em", raw:"*text*", text:"text", tokens:[
{type:"text", raw:"text", text:"text"}
]}
]}
]}
]
We could find a way to translate that into highlight.js tokens.
I'm a bit confused seeing the text
repeated multiple times but would a good rule be to ignore the parent text
attribute in cases where there are tokens
children nodes? Is all the actual verbatim text content going to get dumped into a type:text
node at the bottom of the tree eventually?
I suppose I could just look at your own rendering code to see how it's handling that. :)
Oh actually we'd have to be more careful since we'd have to look at raw too... rebuilding the markdown might result in some weird edge cases, so we'd really want the scopes AND the raw text... that might make it a bit harder I think.
@UziTech Is there any option to make the parser/lexer spit out more context, such as the position (index) of tokens in the original source string?
No the position is not saved but could be figured out from the raw
text. It is a little difficult because some text is converted right away (like \r\n
to \n
and tabs to 4 spaces) so raw
isn't exactly what is sent in but it is as close as we can get.
What would the ideal highlight.js tokens look like for that markdown?
some text is converted right away
Are there any options to disable/configure that? (or is it required for the lexer?)
(I added emphasis inside the header)
I'm now thinking the simplest thing (if we had reliable start/end indexes - or could generate them easily based on walking the tree and examining raw
) might be to just dynamically insert the scopes on our end based on the start/stop positions (which we'd need from marked
)... and then for anything we want to "rewrite" significantly it gets tricker... you'll see that we think of links entirely differently, as a string
and url/link
component... with the string component not being further processed.
Right now though I'm unsure which raw
components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:
> require('marked').lexer('> I am using marked.')
[
{
type: "blockquote",
raw: "> I am using marked.",
tokens: [
{
type: "paragraph",
raw: "I am using marked.",
text: "I am using marked.",
tokens: [
{
type: "text",
raw: "I am using marked.",
text: "I am using marked."
}
]
}
]
},
links: {}
]
The same content "I am using marked." is repeated 3 times in raw
... and the top-most node also includes the prefix "> "
for the block quote... we'd need to know that block quote started at index 0
and that paragraph started at position 2
(or 3, etc, it would depend in the whitespace I imagine)...
I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the >...
prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leverage marked
with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.
Might be time to dig into the lexer source and poke around.
Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens. I'll try to write a poc tonight.
Once marked parses the markdown into tokens it should be pretty easy to parse the raw of each token with regexps to change it into highlight.js tokens.
Oh yes, I'm not super worried about that part (I just wonder if there are any harder ones I'm not thinking of)... and for many tokens all we need is to wrap the text in a block (based on start/stop index)... some we really don't are about at all (paragraphs, etc)...
I'll try to write a poc tonight.
That'd be awesome, though I wasn't necessarily asking for anyone to do the work - I was just trying to flesh out how feasible of an approach this is. The key thing is if we have 43,838 bytes in we need the same 43,838 bytes back out (just with HTML inserted for visual styling)... since we're just highlighting the raw code, we're not doing any rendering of the Markdown.
So my original idea of us providing a Renderer
wouldn't work at all. One example: The rendered API has a hr()
method, but it doesn't seem to get access to the raw text. So we need to not only know "a HR goes here" but we need to know whether the original raw text was ===\n
or =========\n
(or some other variant).
Of course you probably already realize that since you jumped straight to the lexer output rather than taking about the Renderer.
This is our internal "emitter" API that builds the token tree on our side: https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104
I had originally hoped we could just walk the lexer tree and then make calls into the emitter as we went. (well, second hope after realizing we can' just be a Renderer)
since we replace some things before tokenization it wouldn't be possible to get a complete byte for byte transformation but I think we can get close enough for the result to be usable. If we use a custom extension we get access to the whole token in the renderer functions.
The one thing I am thinking might be difficult is if we want to color things sometimes but not others. (e.g tokenize bold text except when it is inside link text.) The renderer doesn't get the information about parents.
The renderer doesn't get the information about parents.
Yeah, I'm thinking just using the lex output directly might be simpler since it has all that information... for example for a link
we'd probably just take the raw
attribute and quickly process that to spit out the string
and link
portions.
I'm looking at the lexer now... \r\n
is already known to be bad mojo with Highlight.js so the \n
replacement is something we could probably live with (and may even enforce in the future anyways)... so I think that just leaves the tabs hmmmm.
the browser doesn't distinguish between tabs and spaces anyway and since the goal of highlight.js is to produce html that should be rendered by a browser I don't know if that would be a big deal.
Here is a POC just to show how our renderer could be used to output the html for highlight.js I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.
It turned out to be easy to only render emphasis in the header and not in the link text. Just don't call parseInline
on the tokens for the link.
// marked-highlight.js
export const highlight = {
extensions: [
{
name: 'heading',
level: 'block',
renderer(token) {
const match = token.raw.match(/^(#+ +).+(\n+)$/);
const newlines = '\n<br />'.repeat(match[2].length);
const text = this.parser.parseInline(token.tokens);
return `<span class="hljs-section">${match[1]}${text}</span>${newlines}`;
}
},
{
name: 'paragraph',
level: 'block',
renderer(token) {
return this.parser.parseInline(token.tokens);
}
},
{
name: 'em',
level: 'inline',
renderer(token) {
return `<span class="hljs-emphasis">${token.raw}</span>`;
}
},
{
name: 'strong',
level: 'inline',
renderer(token) {
return `<span class="hljs-strong">${token.raw}</span>`;
}
},
{
name: 'link',
level: 'inline',
renderer(token) {
return `\n[<span class="hljs-string">${token.text}</span>](<span class="hljs-link">${token.href}</span>)`;
}
}
]
};
import { marked } from 'marked';
import { highlight } from './marked-highlight.js';
marked.use(highlight);
console.log(marked(`
# heading *emphasis*
[**bold** *text*](link.html)
`));
output:
<span class="hljs-section"># heading <span class="hljs-emphasis">*emphasis*</span></span>
<br />
<br />
[<span class="hljs-string">**bold** *text*</span>](<span class="hljs-link">link.html</span>)
I know the goal is to get it into tokens but I don't think it would be too difficult to figure out how to change this to accomplish that.
Do you know if there is a way to get non-text out of the pipeline? I turned all the return
values into objects btu all I get at the end is a string of:
[object Object][object Object][object Object]
The marked parser is made to output HTML in string format but you don't have to use the output. The renderer functions could instead populate some object and output an empty string.
Right now though I'm unsure which
raw
components we'd even want to add up though... because it's not as simple as "ignore the parent, always add up the children" or anything (at a glance), say taking the example:> require('marked').lexer('> I am using marked.') [ { type: "blockquote", raw: "> I am using marked.", tokens: [ { type: "paragraph", raw: "I am using marked.", text: "I am using marked.", tokens: [ { type: "text", raw: "I am using marked.", text: "I am using marked." } ] } ] }, links: {} ]
The same content "I am using marked." is repeated 3 times in
raw
... and the top-most node also includes the prefix"> "
for the block quote... we'd need to know that block quote started at index0
and that paragraph started at position2
(or 3, etc, it would depend in the whitespace I imagine)...I mean I suppose we could try to write handlers for each type of token on our side... so that we analyze the
>...
prefix ourselves, then increment the index accordingly before we jump into the children... I think I was hoping it might be easier than that... the big benefit would be if we could leveragemarked
with just some very simple glue code. As soon as we have to build a whole in-between analyzer... the benefits start to disappear fast.Might be time to dig into the lexer source and poke around.
It's really a problem, and I have no idea about how to put lex
result back to the content, what I am focusing on will only be the parts which are not code, rather than everything.
Even lex
can return all information (like index) we needed for customizing the content, I still think we might need to execute the rendering tasks several times, if I want to handle different types of content, so that it definitely will slow the site.
@UziTech Is there any way we can custom the pipeline and do a step-by-step processing for the content? The best will be that we can even custom the tasks order in pipeline.
For example:
```code```\n\n<!-- commnents --> here\n\n```<!-- commnents -->```\n# <!-- commnents --> in heading `code`\n
marked(content, [
'code',
'custom1',
'heading',
'custom2',
]);
- Code step (Should only use code rules to handle the content):
<pre><code>code</pre></code>\n\n<!-- commnents --> here\n\n<pre></code><!-- commnents --><pre></code>\n# <!-- commnents --> in heading <code>code</code>\n
- custom parser – Comment Replacer (can use Marked's render or render content as you wish)
// Replace '<!-- custom commnents -->' into 'text' (will ignore the code block):
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n# text in heading <code>code</code>\n
- heading step
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1>text in heading <code>code</code></h1>\n
- custom parser – Add custom links to heading
<pre><code>code</pre></code>\n\ntext here\n\n<pre></code><!-- commnents --><pre></code>\n<h1 href='#text-1024'>text in heading <code>code</code></h1>\n
- return result
I don't think I need to open a new issue for what I said above, if you prefer me to do that, let me know.