highlight.js
highlight.js copied to clipboard
Support Prism.js grammar parsing but use the Highlight.js HTML/theme pipeline
Related #2212.
I'm not sure this belongs in core yet, and I don't necessarily want a hard or soft Prism dependency. I suppose perhaps you could also pass Prism into the function itself though so it's late binding?
Caveats
- Using custom parsers is still private API since the
EmitterAPI is public yet. Hence the__emitTokensname of the property. - See #3620
What needs to be done
- Figure out where this goes, how to package it, etc.
- Write a comprehensive
prismTypeToScopefunction- Ref: https://github.com/highlightjs/highlight.js/issues/2212#issuecomment-1243663341
- Ref: https://prismjs.com/tokens.html
Example Usage
To replace our JavaScript support with Prism's:
import Prism from 'prismjs'
const prismJS = fromPrism(Prism.languages.javascript)
hljs.registerLanguage("javascript", prismJS)
Code
function prismTypeToScope(name) {
// very poor results, must be improved
return name;
}
function prismTokensToEmitter(tokens, emitter) {
tokens.forEach(token => {
if (typeof(token) === "string") {
emitter.addText(token)
} else if (token.type) {
let scope = prismTypeToScope(token.type)
if (typeof(token.content) === "string") {
emitter.addKeyword(token.content, scope)
} else { // array of tokens
emitter.openNode(scope)
prismTokensToEmitter(token.content, emitter)
emitter.closeNode()
}
}
})
}
function prismParserWrapper(code, emitter) {
const tokens = Prism.tokenize(code, this._prism)
prismTokensToEmitter(tokens, emitter)
}
function fromPrism(prismGrammar) {
return function(hljs) {
let self = {}
Object.assign(self, {
_prism: prismGrammar,
__emitTokens: prismParserWrapper.bind(self)
})
return self
}
}
@RunDevelopment If I wanted to package this as a small stand-alone ESM module, would I just import Prism and let people's bundlers/client-side figure it out or would it be simpler to just pass in the Prism object that the user is responsible to import on their own? I'm spoiled by the fact that Highlight.js has zero real runtime dependencies.
(not caring about CJS for this ATM)
Or perhaps Prism.languages.javascript has some backreference to Prism itself? That'd be useful here.
Thoughts? Actually I don't even know if Prism is ESM client-side yet, perhaps not?
Actually I don't even know if Prism is ESM client-side yet
We are going to be. ESM is very much planned as the only module system. We are likely also going to have monolithic files for compatibility, but that's about it.
would I just
import Prism
Hopefully not. Prism v2 will be very explicit about instances. One of the problem we had with v1 was that Prism was a global namespace. This made things like testing and typing really difficult. In v2, Prism is a class. There will be a global instance (for compatibility and convenience), but you shouldn't assume that people are going to use it.
So the fromPrism function should probably look like this:
// Take a Prism instance and the id of the language to adapt.
function fromPrism(prism: Prism, id: string) {
return function(hljs) {
return {
__emitTokens(code, emitter) {
let grammar = prism.components.getLanguage(id)
if (!grammar) {
// Decide how to handle missing grammars. I'm just gonna create an empty grammar.
grammar = {}
}
const tokens = prism.tokenize(code, grammar)
prismTokensToEmitter(tokens, emitter)
}
}
}
}
// or
// Take a component proto and add it to your own Prism instance.
function fromPrism(proto: import("prismjs").ComponentProto) {
const prism = getHLJSPrismInstance()
prism.components.add(proto)
// same as the above
return fromPrism(prism, proto.id);
}
Also, language grammars are lazily evaluated in v2. They might even be re-evaluated later because of optional dependencies. So no matter what, your API must not take grammar objects. Use either ids or component protos.
getHLJSPrismInstance()
I'm not sure this would be a thing (or that I see the need?) If someone wanted a single prism instance they should just create one and always use it with fromPrism... if they wanted one Prism per grammar for some reason, they could do that... not sure we should care?
If they wanted to get the prism instance "attached" to a specific grammar we could expose those on the returned grammar object and then could just query it:
hljs.getLanguage("javascript")._prismInstance
I'm not sure this would be a thing (or that I see the need?)
Same. I just wanted to show how an API that only takes a component proto would be implemented. I just wasn't sure in which direction you want to take this.
I'm not sure. I'm hopeful someone comes along who's interested in the capability. I'm not really looking to maintain further pieces of HLJS outside of core... so right now we'd need a good reason to have it in core - or it's fair game for anyone who wants to come along and just make a minimal wrapper library and release it. Right now it definitely feels like more of a plugin/add-on.
Long term I'm very curious what support like this will do for the bigger picture.
And of course it already works - I tested it. It just needs to be packaged up nicely with some tiny amount of docs, etc... (and of course the scope <-> class mapping effort)
I'm currently slightly more interested in wrapping CodeMirror's JSX/TSX Lexer since JSX/TSX is (IME) so hard to get right with pure regex and not a full parser. Though that's a size price to be paid for all that power.