highlight.js Support Prism.js grammar parsing but use the Highlight.js HTML/theme pipeline

Related #2212.

I'm not sure this belongs in core yet, and I don't necessarily want a hard or soft Prism dependency. I suppose perhaps you could also pass Prism into the function itself though so it's late binding?

Caveats

Using custom parsers is still private API since the Emitter API is public yet. Hence the __emitTokens name of the property.
See #3620

What needs to be done

Figure out where this goes, how to package it, etc.
Write a comprehensive prismTypeToScope function
- Ref: https://github.com/highlightjs/highlight.js/issues/2212#issuecomment-1243663341
- Ref: https://prismjs.com/tokens.html

Example Usage

To replace our JavaScript support with Prism's:

import Prism from 'prismjs'
const prismJS = fromPrism(Prism.languages.javascript)
hljs.registerLanguage("javascript", prismJS)

Code

function prismTypeToScope(name) {
  // very poor results, must be improved
  return name;
}

function prismTokensToEmitter(tokens, emitter) {
  tokens.forEach(token => {
    if (typeof(token) === "string") {
      emitter.addText(token)
    } else if (token.type) {
      let scope = prismTypeToScope(token.type)
      if (typeof(token.content) === "string") {
        emitter.addKeyword(token.content, scope)
      } else { // array of tokens
        emitter.openNode(scope)
        prismTokensToEmitter(token.content, emitter)
        emitter.closeNode()
      }
    }
  })
}

function prismParserWrapper(code, emitter) {
  const tokens = Prism.tokenize(code, this._prism)
  prismTokensToEmitter(tokens, emitter)
}

function fromPrism(prismGrammar) {
  return function(hljs) {
    let self = {}

    Object.assign(self, {
      _prism: prismGrammar,
      __emitTokens: prismParserWrapper.bind(self)
    })
    return self
  }
}

Sep 13 '22 15:09 joshgoebel

@RunDevelopment If I wanted to package this as a small stand-alone ESM module, would I just import Prism and let people's bundlers/client-side figure it out or would it be simpler to just pass in the Prism object that the user is responsible to import on their own? I'm spoiled by the fact that Highlight.js has zero real runtime dependencies.

(not caring about CJS for this ATM)

Or perhaps Prism.languages.javascript has some backreference to Prism itself? That'd be useful here.

Thoughts? Actually I don't even know if Prism is ESM client-side yet, perhaps not?

Sep 13 '22 15:09 joshgoebel

Actually I don't even know if Prism is ESM client-side yet

We are going to be. ESM is very much planned as the only module system. We are likely also going to have monolithic files for compatibility, but that's about it.

would I just import Prism

Hopefully not. Prism v2 will be very explicit about instances. One of the problem we had with v1 was that Prism was a global namespace. This made things like testing and typing really difficult. In v2, Prism is a class. There will be a global instance (for compatibility and convenience), but you shouldn't assume that people are going to use it.

So the fromPrism function should probably look like this:

// Take a Prism instance and the id of the language to adapt.
function fromPrism(prism: Prism, id: string) {
  return function(hljs) {
    return {
      __emitTokens(code, emitter) {
        let grammar = prism.components.getLanguage(id)
        if (!grammar) {
          // Decide how to handle missing grammars. I'm just gonna create an empty grammar.
          grammar = {} 
        }
        const tokens = prism.tokenize(code, grammar)
        prismTokensToEmitter(tokens, emitter)
      }
    }
  }
}

// or

// Take a component proto and add it to your own Prism instance.
function fromPrism(proto: import("prismjs").ComponentProto) {
  const prism = getHLJSPrismInstance()
  prism.components.add(proto)
  // same as the above
  return fromPrism(prism, proto.id);
}

Also, language grammars are lazily evaluated in v2. They might even be re-evaluated later because of optional dependencies. So no matter what, your API must not take grammar objects. Use either ids or component protos.

Sep 13 '22 17:09 RunDevelopment

getHLJSPrismInstance()

I'm not sure this would be a thing (or that I see the need?) If someone wanted a single prism instance they should just create one and always use it with fromPrism... if they wanted one Prism per grammar for some reason, they could do that... not sure we should care?

If they wanted to get the prism instance "attached" to a specific grammar we could expose those on the returned grammar object and then could just query it:

hljs.getLanguage("javascript")._prismInstance

Sep 13 '22 17:09 joshgoebel

I'm not sure this would be a thing (or that I see the need?)

Same. I just wanted to show how an API that only takes a component proto would be implemented. I just wasn't sure in which direction you want to take this.

Sep 13 '22 20:09 RunDevelopment

I'm not sure. I'm hopeful someone comes along who's interested in the capability. I'm not really looking to maintain further pieces of HLJS outside of core... so right now we'd need a good reason to have it in core - or it's fair game for anyone who wants to come along and just make a minimal wrapper library and release it. Right now it definitely feels like more of a plugin/add-on.

Long term I'm very curious what support like this will do for the bigger picture.

And of course it already works - I tested it. It just needs to be packaged up nicely with some tiny amount of docs, etc... (and of course the scope <-> class mapping effort)

I'm currently slightly more interested in wrapping CodeMirror's JSX/TSX Lexer since JSX/TSX is (IME) so hard to get right with pure regex and not a full parser. Though that's a size price to be paid for all that power.

Sep 13 '22 20:09 joshgoebel

highlight.js highlight.js copied to clipboard

Support Prism.js grammar parsing but use the Highlight.js HTML/theme pipeline

Caveats

What needs to be done

Example Usage

Code

highlight.js
highlight.js copied to clipboard