rehype-pretty-code icon indicating copy to clipboard operation
rehype-pretty-code copied to clipboard

Support arbitrary delimiter in wordHighlighter

Open chikamichi opened this issue 1 year ago • 2 comments

Hi!

Context

I needed to match a pattern containing a / character, therefore / being used as the hard-coded delimiter proved limiting.

Proposal

Would you consider a PR building upon this core change (I used patch-package to get things going quickly):

diff --git a/node_modules/rehype-pretty-code/dist/rehype-pretty-code.cjs b/node_modules/rehype-pretty-code/dist/rehype-pretty-code.cjs
index 54c1c36..26d8a36 100644
--- a/node_modules/rehype-pretty-code/dist/rehype-pretty-code.cjs
+++ b/node_modules/rehype-pretty-code/dist/rehype-pretty-code.cjs
@@ -12897,12 +12897,12 @@ function rehypePrettyCode(options = {}) {
         const wordIdsMap = new Map();
 
         const wordMatches = meta
-          ? [...meta.matchAll(/\/(.*?)\/(\S*)/g)]
+          ? [...meta.matchAll(/(?<delimiter>["\/])(?<word>.*?)\k<delimiter>(?<wordIdAndOrRange>\S*)/g)]
           : undefined;
         if (Array.isArray(wordMatches)) {
           wordMatches.forEach((name, index) => {
-            const word = wordMatches[index][1];
-            const wordIdAndOrRange = wordMatches[index][2];
+            const { word, wordIdAndOrRange } = name.groups;
             words.push(word);
 
             const [range, id] = wordIdAndOrRange.split('#');

Remarks

  • Here’s a test sample on Regex101 (I used PCRE as I feel like it’s closer to V8’s Irregexp engine than EcmaScript’s, but it works regardless the engine).
  • I leveraged named capture groups to increase readibility, which have been supported since Node 10. As rehype-pretty-code requires "node": "^12.16.0 || >=13.2.0" I feel like it’s fine using them.

Questions

  • There still is a hard-coded list of allowed delimiters, hence that will never cover all use-cases (eg. what if my pattern contains both " and /?). I feel like we could provide a sane, exported default list and also allow developers to provide their own as an escape hatch (and because we’d export the default list, they could build upon it for maximum compatibility/future-proofness).
    • Another option would be to support pattern-matching within the pattern-matching. It seems you considered it for the doc reads Note regex itself is not yet supported.
  • Hacking on the code, I felt like the naming for some variables is subpar. words/word in particular, as word may actually contain multiple words (any pattern the developer wants to match within the code block, really). I didn’t alter that naming but I’d propose doing it in the PR.

What do you think?

chikamichi avatar Feb 26 '23 01:02 chikamichi

There still is a hard-coded list of allowed delimiters, hence that will never cover all use-cases (eg. what if my pattern contains both " and /?).

Is it not possible to use \ to add escaping like strings when they contain ' or "?

Hacking on the code, I felt like the naming for some variables is subpar. words/word in particular, as word may actually contain multiple words (any pattern the developer wants to match within the code block, really). I didn’t alter that naming but I’d propose doing it in the PR.

I agree with this, it was named like that as there were limitations to how it worked but it's been able to capture any string for a year now, there's an issue open: https://github.com/atomiks/rehype-pretty-code/issues/64

atomiks avatar Feb 26 '23 03:02 atomiks

There still is a hard-coded list of allowed delimiters, hence that will never cover all use-cases (eg. what if my pattern contains both " and /?).

Is it not possible to use \ to add escaping like strings when they contain ' or "?

With the current implementation I guess not, the meta string is dumb-parsed. Support for escaping sequences would require treating each pattern (the infamous word variable) as a regexp itself, not as a mere string. Maybe that's the right call.

Hacking on the code, I felt like the naming for some variables is subpar. words/word in particular, as word may actually contain multiple words (any pattern the developer wants to match within the code block, really). I didn’t alter that naming but I’d propose doing it in the PR.

I agree with this, it was named like that as there were limitations to how it worked but it's been able to capture any string for a year now, there's an issue open: #64

Thanks for pointing that out.

chikamichi avatar Feb 26 '23 06:02 chikamichi