marked icon indicating copy to clipboard operation
marked copied to clipboard

Question: Position of a token in the source string

Open nidoro opened this issue 2 years ago • 34 comments

I'm working on a project (a syntax highlight for an editor) which require me to have access to the position of a token within the source string. After scanning through the lexer and parsing documentation I didn't find a way to do so. Ideally, for my use case, the tokens returned by the lex(...) function would contain the character position (line number and column number) of the start and end of the token (or token raw size, which I think it's already available).

Is there already a way to do know the position of each token? If not, consider this a feature proposal :) I'm sure it is an easy thing to add.

nidoro avatar Jul 12 '21 19:07 nidoro

You are correct, we do not currently log the string positions of the tokens.

You may be able to get something to work with the walkTokens feature by tracking the sum of the token "raw" lengths and adding a property to each token with the current total. Things would get more complex once you start getting into sub-tokens though but it should be possible.

calculuschild avatar Jul 12 '21 19:07 calculuschild

I'm sure it is an easy thing to add

@nidoro We always appreciate PRs 😁👍

UziTech avatar Jul 12 '21 22:07 UziTech

I just started using the library, and my knowledge of its inner workings is too little to make a pull request. But I did some changes that seem to be working. I'll explain what I did and would really appreciate your feedback to make sure I'm doing things correctly. I did some testing and things are working 99% of the time, but I'm still missing something.

I've modified the lexer so that it returns the position of the token in the source. So each token returned by the lex(...) function have two new members: start: {line, column, index} and end: {line, column, index}. For my use case, I only need the line and column, but I went ahead and included the index in case other users need it. Also, the interval [start, end] is inclusive, meaning the end is part of the token. My changes can be summarized in four steps:

  1. Three Lexer functions have been modified to accept an at parameter, which indicates where we are in the source file (at: {line, column, index}).
blockTokens(src, tokens, top, at)
inline(tokens, at)
inlineTokens(src, tokens, at, inLink, inRawBlock)
  1. The lex(...) function now looks like this:
function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      this.blockTokens(src, this.tokens, true, at); 
      at = {line: 0, column: 0, index: 0}; 
      this.inline(this.tokens, at);
      return this.tokens;
}
  1. I've implemented three helper functions in the Lexer:
function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
}
 
// Advances the 'at' iterator by 'count' characters.
function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
}

// Eats the token that starts 'src', meaning it sets the token
// start and end positions, advances the 'at' iterator to skip
// the token and returns the remaining string.
function eatToken(src, token, at) {
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
}
  1. Finally, now it is just a matter of searching and replacing some function calls. The occurrences of src = src.substring(token.raw.length); have been replaced by src = this.eatToken(src, token, at). And the calls to the functions blockTokens(...), inline(...) and inlineTokens(...) now include the parameter at.

I think the at parameter passed at these function calls sometimes need to be a copy rather than a reference, but I'm not sure when. You can see when I passed a copy rather than a reference bellow. I've made the changes directly to /lib/marked.js, which I figured was the quickest and dirtiest way for me to test it.

Click to see Lexer changes (I've only pasted the "Block Lexer" section of the file, which contains all the changes)
  /**
   * Block Lexer
   */


  var Lexer_1 = /*#__PURE__*/function () {
    function Lexer(options) {
      this.tokens = [];
      this.tokens.links = Object.create(null);
      this.options = options || defaults$3;
      this.options.tokenizer = this.options.tokenizer || new Tokenizer$1();
      this.tokenizer = this.options.tokenizer;
      this.tokenizer.options = this.options;
      var rules = {
        block: block.normal,
        inline: inline.normal
      };

      if (this.options.pedantic) {
        rules.block = block.pedantic;
        rules.inline = inline.pedantic;
      } else if (this.options.gfm) {
        rules.block = block.gfm;

        if (this.options.breaks) {
          rules.inline = inline.breaks;
        } else {
          rules.inline = inline.gfm;
        }
      }

      this.tokenizer.rules = rules;
    }
    /**
     * Expose Rules
     */


    /**
     * Static Lex Method
     */
    Lexer.lex = function lex(src, options) {
      var lexer = new Lexer(options);
      return lexer.lex(src);
    }
    /**
     * Static Lex Inline Method
     */
    ;

    Lexer.lexInline = function lexInline(src, options) {
      var lexer = new Lexer(options);
      return lexer.inlineTokens(src);
    }
    /**
     * Preprocessing
     */
    ;

    var _proto = Lexer.prototype;

    _proto.lex = function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      
      this.blockTokens(src, this.tokens, true, at);
      
      at = {line: 0, column: 0, index: 0};
      
      this.inline(this.tokens, at);
      return this.tokens;
    }
    /**
     * Lexing
     */
    ;
    
    _proto.copyAt = function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
    }
    
    _proto.advance = function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
    }
    
    _proto.eatToken = function eatToken(src, token, at) {
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
    }

    _proto.blockTokens = function blockTokens(src, tokens, top, at) {
      var _this = this;

      if (tokens === void 0) {
        tokens = [];
      }

      if (top === void 0) {
        top = true;
      }
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (this.options.pedantic) {
        src = src.replace(/^ +$/gm, '');
      }

      var token, i, l, lastToken, cutSrc, lastParagraphClipped;

      while (src) {
        if (this.options.extensions && this.options.extensions.block && this.options.extensions.block.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // newline


        if (token = this.tokenizer.space(src)) {
          src = this.eatToken(src, token, at);

          if (token.type) {
            tokens.push(token);
          }

          continue;
        } // code


        if (token = this.tokenizer.code(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1]; // An indented code block cannot interrupt a paragraph.

          if (lastToken && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // fences


        if (token = this.tokenizer.fences(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // heading


        if (token = this.tokenizer.heading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // table no leading pipe (gfm)


        if (token = this.tokenizer.nptable(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // hr


        if (token = this.tokenizer.hr(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // blockquote


        if (token = this.tokenizer.blockquote(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.blockTokens(token.text, [], top, this.copyAt(at));
          tokens.push(token);
          continue;
        } // list


        if (token = this.tokenizer.list(src)) {
          src = this.eatToken(src, token, at);
          l = token.items.length;

          for (i = 0; i < l; i++) {
            token.items[i].tokens = this.blockTokens(token.items[i].text, [], false, this.copyAt(at));
          }

          tokens.push(token);
          continue;
        } // html


        if (token = this.tokenizer.html(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // def


        if (top && (token = this.tokenizer.def(src))) {
          src = this.eatToken(src, token, at);

          if (!this.tokens.links[token.tag]) {
            this.tokens.links[token.tag] = {
              href: token.href,
              title: token.title
            };
          }

          continue;
        } // table (gfm)


        if (token = this.tokenizer.table(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // lheading


        if (token = this.tokenizer.lheading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // top-level paragraph
        // prevent paragraph consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startBlock) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this.options.extensions.startBlock.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (top && (token = this.tokenizer.paragraph(cutSrc))) {
          lastToken = tokens[tokens.length - 1];

          if (lastParagraphClipped && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          lastParagraphClipped = cutSrc.length !== src.length;
          src = this.eatToken(src, token, at);
          continue;
        } // text


        if (token = this.tokenizer.text(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _proto.inline = function inline(tokens, at) {
      var i, j, k, l2, row, token;
      var l = tokens.length;

      for (i = 0; i < l; i++) {
        token = tokens[i];

        switch (token.type) {
          case 'paragraph':
          case 'text':
          case 'heading':
            {
              token.tokens = [];
              this.inlineTokens(token.text, token.tokens, {line: token.start.line, column: token.start.column});
              break;
            }

          case 'table':
            {
              token.tokens = {
                header: [],
                cells: []
              }; // header

              l2 = token.header.length;

              for (j = 0; j < l2; j++) {
                token.tokens.header[j] = [];
                this.inlineTokens(token.header[j], token.tokens.header[j], at);
              } // cells


              l2 = token.cells.length;

              for (j = 0; j < l2; j++) {
                row = token.cells[j];
                token.tokens.cells[j] = [];

                for (k = 0; k < row.length; k++) {
                  token.tokens.cells[j][k] = [];
                  this.inlineTokens(row[k], token.tokens.cells[j][k], at);
                }
              }

              break;
            }

          case 'blockquote':
            {
              this.inline(token.tokens, at);
              break;
            }

          case 'list':
            {
              l2 = token.items.length;

              for (j = 0; j < l2; j++) {
                this.inline(token.items[j].tokens, at);
              }

              break;
            }
        }
      }

      return tokens;
    }
    /**
     * Lexing/Compiling
     */
    ;

    _proto.inlineTokens = function inlineTokens(src, tokens, at, inLink, inRawBlock) {
      var _this2 = this;
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (tokens === void 0) {
        tokens = [];
      }

      if (inLink === void 0) {
        inLink = false;
      }

      if (inRawBlock === void 0) {
        inRawBlock = false;
      }

      var token, lastToken, cutSrc; // String with links masked to avoid interference with em and strong

      var maskedSrc = src;
      var match;
      var keepPrevChar, prevChar; // Mask out reflinks

      if (this.tokens.links) {
        var links = Object.keys(this.tokens.links);

        if (links.length > 0) {
          while ((match = this.tokenizer.rules.inline.reflinkSearch.exec(maskedSrc)) != null) {
            if (links.includes(match[0].slice(match[0].lastIndexOf('[') + 1, -1))) {
              maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.reflinkSearch.lastIndex);
            }
          }
        }
      } // Mask out other blocks


      while ((match = this.tokenizer.rules.inline.blockSkip.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.blockSkip.lastIndex);
      } // Mask out escaped em & strong delimiters


      while ((match = this.tokenizer.rules.inline.escapedEmSt.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '++' + maskedSrc.slice(this.tokenizer.rules.inline.escapedEmSt.lastIndex);
      }

      while (src) {
        if (!keepPrevChar) {
          prevChar = '';
        }

        keepPrevChar = false; // extensions

        if (this.options.extensions && this.options.extensions.inline && this.options.extensions.inline.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this2, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // escape


        if (token = this.tokenizer.escape(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // tag


        if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
          src = this.eatToken(src, token, at);
          inLink = token.inLink;
          inRawBlock = token.inRawBlock;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // link


        if (token = this.tokenizer.link(src)) {
          src = this.eatToken(src, token, at);

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), true, inRawBlock);
          }

          tokens.push(token);
          continue;
        } // reflink, nolink


        if (token = this.tokenizer.reflink(src, this.tokens.links)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), true, inRawBlock);
            tokens.push(token);
          } else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // em & strong


        if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // code


        if (token = this.tokenizer.codespan(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // br


        if (token = this.tokenizer.br(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // del (gfm)


        if (token = this.tokenizer.del(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(at), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // autolink


        if (token = this.tokenizer.autolink(src, mangle)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // url (gfm)


        if (!inLink && (token = this.tokenizer.url(src, mangle))) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // text
        // prevent inlineText consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startInline) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this2.options.extensions.startInline.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
          src = this.eatToken(src, token, at);

          if (token.raw.slice(-1) !== '_') {
            // Track prevChar before string of ____ started
            prevChar = token.raw.slice(-1);
          }

          keepPrevChar = true;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _createClass(Lexer, null, [{
      key: "rules",
      get: function get() {
        return {
          block: block,
          inline: inline
        };
      }
    }]);

    return Lexer;
  }();

Like I said, everything seems to be working 99% os the time, but I've noticed an incorrect result for the following markdown source, and I suspect there are other cases that would generate incorrect results:

> quote
> > > quote
# test

paragraph

I appreciate any help on this. Thank you for the library!

nidoro avatar Jul 13 '21 14:07 nidoro

I think this is going to be much harder (nearly impossible) because of the line src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, ' ');. If the user uses tabs we won't be able to tell if four spaces are supposed to be one character or four.

UziTech avatar Jul 13 '21 22:07 UziTech

also this looks like it is going to slow marked down a lot checking every character for \n

UziTech avatar Jul 13 '21 22:07 UziTech

You raise valid points, but depending on the use case, they may or may not be of great importance. For instance, in my use case, we only feed marked with tab-free input. And about the hit on the performance, this is a trade-off I'm willing to make, and possibly other users too. If it gets too out of hand (which will only happen for large files), I can try running it asynchronously.

Anyway, I do understand that this is not a highly demanded feature, but I do think this is an improvement on the library. Maybe you can make it optional, so the default behavior stays the same, but you have a returnTokenPosition boolean option to activate this.

I've fixed some bugs of my previous code. It is still not perfect, but it is better. The only (hopefully) problem I'm still having is with nested styles, like emphasis inside lists. But I think it is just a matter of time to get it 100%.

Click to show code
  /**
   * Block Lexer
   */


  var Lexer_1 = /*#__PURE__*/function () {
    function Lexer(options) {
      this.tokens = [];
      this.tokens.links = Object.create(null);
      this.options = options || defaults$3;
      this.options.tokenizer = this.options.tokenizer || new Tokenizer$1();
      this.tokenizer = this.options.tokenizer;
      this.tokenizer.options = this.options;
      var rules = {
        block: block.normal,
        inline: inline.normal
      };

      if (this.options.pedantic) {
        rules.block = block.pedantic;
        rules.inline = inline.pedantic;
      } else if (this.options.gfm) {
        rules.block = block.gfm;

        if (this.options.breaks) {
          rules.inline = inline.breaks;
        } else {
          rules.inline = inline.gfm;
        }
      }

      this.tokenizer.rules = rules;
    }
    /**
     * Expose Rules
     */


    /**
     * Static Lex Method
     */
    Lexer.lex = function lex(src, options) {
      var lexer = new Lexer(options);
      return lexer.lex(src);
    }
    /**
     * Static Lex Inline Method
     */
    ;

    Lexer.lexInline = function lexInline(src, options) {
      var lexer = new Lexer(options);
      return lexer.inlineTokens(src);
    }
    /**
     * Preprocessing
     */
    ;

    var _proto = Lexer.prototype;

    _proto.lex = function lex(src) {
      src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, '    ');
      
      let at = {line: 0, column: 0, index: 0};
      
      this.blockTokens(src, this.tokens, true, at);
      
      at = {line: 0, column: 0, index: 0};
      
      this.inline(this.tokens, at);
      return this.tokens;
    }
    /**
     * Lexing
     */
    ;
    
    _proto.copyAt = function copyAt(at) {
      return {line: at.line, column: at.column, index: at.index};
    }
    
    _proto.advance = function advance(src, at, count) {
      for (let i = 0; i < count; ++i) {
        let c = src[i];
        if (c == '\n') {
          ++at.line;
          at.column = 0;
        } else {
          ++at.column;
        }
        ++at.index;
      }
    }
    
    _proto.eatToken = function eatToken(src, token, at) {
      let textStartOffset = src.indexOf(token.text);
      token.textStart = this.copyAt(at);
      this.advance(src, token.textStart, textStartOffset);
      
      token.start = this.copyAt(at);
      this.advance(src, at, token.raw.length-1);
      token.end   = this.copyAt(at);
      this.advance(src[token.raw.length-1], at, 1);
      return src.substring(token.raw.length);
    }

    _proto.blockTokens = function blockTokens(src, tokens, top, at) {
      var _this = this;

      if (tokens === void 0) {
        tokens = [];
      }

      if (top === void 0) {
        top = true;
      }
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (this.options.pedantic) {
        src = src.replace(/^ +$/gm, '');
      }

      var token, i, l, lastToken, cutSrc, lastParagraphClipped;

      while (src) {
        if (this.options.extensions && this.options.extensions.block && this.options.extensions.block.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // newline


        if (token = this.tokenizer.space(src)) {
          src = this.eatToken(src, token, at);

          if (token.type) {
            tokens.push(token);
          }

          continue;
        } // code


        if (token = this.tokenizer.code(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1]; // An indented code block cannot interrupt a paragraph.

          if (lastToken && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // fences


        if (token = this.tokenizer.fences(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // heading


        if (token = this.tokenizer.heading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // table no leading pipe (gfm)


        if (token = this.tokenizer.nptable(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // hr


        if (token = this.tokenizer.hr(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // blockquote


        if (token = this.tokenizer.blockquote(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.blockTokens(token.text, [], top, this.copyAt(token.textStart));
          tokens.push(token);
          continue;
        } // list


        if (token = this.tokenizer.list(src)) {
          src = this.eatToken(src, token, at);
          l = token.items.length;

          for (i = 0; i < l; i++) {
            token.items[i].tokens = this.blockTokens(token.items[i].text, [], false, this.copyAt(token.textStart));
          }

          tokens.push(token);
          continue;
        } // html


        if (token = this.tokenizer.html(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // def


        if (top && (token = this.tokenizer.def(src))) {
          src = this.eatToken(src, token, at);

          if (!this.tokens.links[token.tag]) {
            this.tokens.links[token.tag] = {
              href: token.href,
              title: token.title
            };
          }

          continue;
        } // table (gfm)


        if (token = this.tokenizer.table(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // lheading


        if (token = this.tokenizer.lheading(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // top-level paragraph
        // prevent paragraph consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startBlock) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this.options.extensions.startBlock.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (top && (token = this.tokenizer.paragraph(cutSrc))) {
          lastToken = tokens[tokens.length - 1];

          if (lastParagraphClipped && lastToken.type === 'paragraph') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          lastParagraphClipped = cutSrc.length !== src.length;
          src = this.eatToken(src, token, at);
          continue;
        } // text


        if (token = this.tokenizer.text(src)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += '\n' + token.raw;
            lastToken.text += '\n' + token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _proto.inline = function inline(tokens, at) {
      var i, j, k, l2, row, token;
      var l = tokens.length;

      for (i = 0; i < l; i++) {
        token = tokens[i];

        switch (token.type) {
          case 'paragraph':
          case 'text':
          case 'heading':
            {
              token.tokens = [];
              this.inlineTokens(token.text, token.tokens, this.copyAt(token.textStart));
              break;
            }

          case 'table':
            {
              token.tokens = {
                header: [],
                cells: []
              }; // header

              l2 = token.header.length;

              for (j = 0; j < l2; j++) {
                token.tokens.header[j] = [];
                this.inlineTokens(token.header[j], token.tokens.header[j], at);
              } // cells


              l2 = token.cells.length;

              for (j = 0; j < l2; j++) {
                row = token.cells[j];
                token.tokens.cells[j] = [];

                for (k = 0; k < row.length; k++) {
                  token.tokens.cells[j][k] = [];
                  this.inlineTokens(row[k], token.tokens.cells[j][k], at);
                }
              }

              break;
            }

          case 'blockquote':
            {
              this.inline(token.tokens, at);
              break;
            }

          case 'list':
            {
              l2 = token.items.length;

              for (j = 0; j < l2; j++) {
                this.inline(token.items[j].tokens, at);
              }

              break;
            }
        }
      }

      return tokens;
    }
    /**
     * Lexing/Compiling
     */
    ;

    _proto.inlineTokens = function inlineTokens(src, tokens, at, inLink, inRawBlock) {
      var _this2 = this;
      
      if (at === void 0) {
        at = {line: 0, column: 0, index: 0};
      }

      if (tokens === void 0) {
        tokens = [];
      }

      if (inLink === void 0) {
        inLink = false;
      }

      if (inRawBlock === void 0) {
        inRawBlock = false;
      }

      var token, lastToken, cutSrc; // String with links masked to avoid interference with em and strong

      var maskedSrc = src;
      var match;
      var keepPrevChar, prevChar; // Mask out reflinks

      if (this.tokens.links) {
        var links = Object.keys(this.tokens.links);

        if (links.length > 0) {
          while ((match = this.tokenizer.rules.inline.reflinkSearch.exec(maskedSrc)) != null) {
            if (links.includes(match[0].slice(match[0].lastIndexOf('[') + 1, -1))) {
              maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.reflinkSearch.lastIndex);
            }
          }
        }
      } // Mask out other blocks


      while ((match = this.tokenizer.rules.inline.blockSkip.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '[' + repeatString('a', match[0].length - 2) + ']' + maskedSrc.slice(this.tokenizer.rules.inline.blockSkip.lastIndex);
      } // Mask out escaped em & strong delimiters


      while ((match = this.tokenizer.rules.inline.escapedEmSt.exec(maskedSrc)) != null) {
        maskedSrc = maskedSrc.slice(0, match.index) + '++' + maskedSrc.slice(this.tokenizer.rules.inline.escapedEmSt.lastIndex);
      }

      while (src) {
        if (!keepPrevChar) {
          prevChar = '';
        }

        keepPrevChar = false; // extensions

        if (this.options.extensions && this.options.extensions.inline && this.options.extensions.inline.some(function (extTokenizer) {
          if (token = extTokenizer.call(_this2, src, tokens)) {
            src = this.eatToken(src, token, at);
            tokens.push(token);
            return true;
          }

          return false;
        })) {
          continue;
        } // escape


        if (token = this.tokenizer.escape(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // tag


        if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
          src = this.eatToken(src, token, at);
          inLink = token.inLink;
          inRawBlock = token.inRawBlock;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // link


        if (token = this.tokenizer.link(src)) {
          src = this.eatToken(src, token, at);

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), true, inRawBlock);
          }

          tokens.push(token);
          continue;
        } // reflink, nolink


        if (token = this.tokenizer.reflink(src, this.tokens.links)) {
          src = this.eatToken(src, token, at);
          lastToken = tokens[tokens.length - 1];

          if (token.type === 'link') {
            token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), true, inRawBlock);
            tokens.push(token);
          } else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        } // em & strong


        if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // code


        if (token = this.tokenizer.codespan(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // br


        if (token = this.tokenizer.br(src)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // del (gfm)


        if (token = this.tokenizer.del(src)) {
          src = this.eatToken(src, token, at);
          token.tokens = this.inlineTokens(token.text, [], this.copyAt(token.textStart), inLink, inRawBlock);
          tokens.push(token);
          continue;
        } // autolink


        if (token = this.tokenizer.autolink(src, mangle)) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // url (gfm)


        if (!inLink && (token = this.tokenizer.url(src, mangle))) {
          src = this.eatToken(src, token, at);
          tokens.push(token);
          continue;
        } // text
        // prevent inlineText consuming extensions by clipping 'src' to extension start


        cutSrc = src;

        if (this.options.extensions && this.options.extensions.startInline) {
          (function () {
            var startIndex = Infinity;
            var tempSrc = src.slice(1);
            var tempStart = void 0;

            _this2.options.extensions.startInline.forEach(function (getStartIndex) {
              tempStart = getStartIndex.call(this, tempSrc);

              if (typeof tempStart === 'number' && tempStart >= 0) {
                startIndex = Math.min(startIndex, tempStart);
              }
            });

            if (startIndex < Infinity && startIndex >= 0) {
              cutSrc = src.substring(0, startIndex + 1);
            }
          })();
        }

        if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
          src = this.eatToken(src, token, at);

          if (token.raw.slice(-1) !== '_') {
            // Track prevChar before string of ____ started
            prevChar = token.raw.slice(-1);
          }

          keepPrevChar = true;
          lastToken = tokens[tokens.length - 1];

          if (lastToken && lastToken.type === 'text') {
            lastToken.raw += token.raw;
            lastToken.text += token.text;
          } else {
            tokens.push(token);
          }

          continue;
        }

        if (src) {
          var errMsg = 'Infinite loop on byte: ' + src.charCodeAt(0);

          if (this.options.silent) {
            console.error(errMsg);
            break;
          } else {
            throw new Error(errMsg);
          }
        }
      }

      return tokens;
    };

    _createClass(Lexer, null, [{
      key: "rules",
      get: function get() {
        return {
          block: block,
          inline: inline
        };
      }
    }]);

    return Lexer;
  }();

nidoro avatar Jul 14 '21 01:07 nidoro

I think this is going to be much harder (nearly impossible) because of the line src = src.replace(/\r\n|\r/g, '\n').replace(/\t/g, ' ');. If the user uses tabs we won't be able to tell if four spaces are supposed to be one character or four.

Another way to make it work is to let the user tell how many spaces a tab corresponds to.

nidoro avatar Jul 14 '21 02:07 nidoro

marked does save the raw src in the token so I think like @calculuschild said it might be easiest to use walkTokens to add the line and column information. Then it can be an extension people can use if they want this information.

UziTech avatar Jul 14 '21 02:07 UziTech

Correct me if I'm wrong, but I think the raw member gets modified during the lexing process. I've seen lines in the source code like this: lastToken.raw += '\n' + token.raw;, which make me think the raw member is not really equivalent to the user input, and if I tried to calculate what line each token is at based on this raw string, I would get incorrect results.

nidoro avatar Jul 14 '21 02:07 nidoro

I think the raw member gets modified during the lexing process.

This is simply merging two adjacent tokens. Occasionally we have to break a paragraph in half to check if a block of code or something else is beginning at that point. If it turns out that the second token is just the rest of a paragraph then we merge them back together. That's all that is happening; it should end up equivalent to the user input.

calculuschild avatar Jul 14 '21 02:07 calculuschild

In fact this is another good reason to do this in walktokens. Some of the tokens are not completely formed when they are first added to the array, but waiting until walkTokens would ensure that the raw values are properly merged and accurate.

calculuschild avatar Jul 14 '21 02:07 calculuschild

I see. I guess there is no advantage in changing the Lexer directly then. Thanks for the clarification. I'll probably still use the solution I'm working on, as it is nearly done, but if I ever feel the need of using the walkTokens solution I'll share it here. Feel free to close the issue.

Again, thank you for the library!

nidoro avatar Jul 14 '21 02:07 nidoro

Hello again,

I abandoned the idea of modifying the lexer. Now I'm trying to use the raw tokens returned by marked.lexer(...) to calculate the position of each token. I was assuming that the concatenation of the raw members of first level tokens would give me the original input. Unfortunately, that's not the case. The following call...

marked.lexer("> quote\n\nparagraph");

returns these tokens:

[
  { type: "blockquote", raw: "> blockquote\n", … }
  { type: "paragraph", raw: "paragraph", … }
]

If I tried to calculate the starting line of the paragraph (or any of the following tokens) using the previous token as a reference, the line would be 2 (if we count from 1). But in the original input, the paragraph is at line 3, like so:

1 | > blockquote
2 | 
3 | paragraph

I'm working around this behavior by adding a line when I encounter a blockquote, but I don't know if that's reliable. So I have two questions:

  1. Is this a bug on marked? I think the raw member would be more intuitively understood if it were possible to reconstruct the original input using only that.
  2. Is skiping a line after blockquote reliable? And are there other cases in which the same issue will happen?

nidoro avatar Jul 21 '21 01:07 nidoro

Welp, there is definitely no way raw can be used to reconstruct the input. The following two calls:

marked.lexer("> quote\n# heading");
marked.lexer("> quote\n\n# heading");

return the exact same tokens:

[
  { type: "blockquote", raw: "> blockquote\n", text: "blockquote\n", … }
​  { type: "heading", raw: "# heading", depth: 1, … }
]

So it is impossible to decide if the user entered one line or two lines after a blockquote.

Is there a fix or workaround for this?

nidoro avatar Jul 21 '21 02:07 nidoro

raw should be the complete string that was consumed by the token. I can see how some beginning or ending newlines might be miscalculated in an off-by-one error somewhere that just happened to not affect the final HTML output.

So yes, it is a bug if the raw does not actually match the text that was consumed.

calculuschild avatar Jul 21 '21 04:07 calculuschild

Is there a fix or workaround for this?

Not yet, but if you want to create a PR we would be very appreciative :+1:

It looks like the space token doesn't always get saved to the token list (not sure why)

https://github.com/markedjs/marked/blob/e7b04a70ee125eb7da0a9202a6ae9c254c1967b5/src/Lexer.js#L143-L149

UziTech avatar Jul 21 '21 04:07 UziTech

I have a similar need for a project. It's a notes app that uses Marked to render, with some interactivity in the html (like checking a checkbox) having a direct result in the source text (adding the x between [ ]). It would be very helpful for this kind of thing for Marked to expose the position of the rendered element in the source text.

I'll have a look at the walkTokens approach sometime soon. I'm only interested in the offset for the token, but it should be easy to go from there to line/column values by counting the newlines up to the offset.

bartnv avatar Sep 02 '21 11:09 bartnv

So I've tried the walkTokens approach, but it went south pretty quickly. My approach was to keep a running sum of the length of the raw field in each token. It sort-of works for block level elements as long as the source text doesn't contain tabs or Windows line-endings, but even then I had to account for quirks in the parser (for instance the last newline of a paragraph is never shown in any raw field if the paragraph is immediately followed by another block-level element). For inline items it all falls down. You'd need to know their offset from their containing block element, but there's no clean way to figure out how much of the raw input was consumed by the block-level element. For instance an H1 heading can validly start with "# " or " # ". That influences the offset of its inline elements, but you'd need to re-parse the raw string to figure that out. I don't want to account for all current and future possibilities there, so I think this is a dead end. For reference, this is the walkTokens function I was testing with (WHICH GIVES INVALID RESULTS, you have been warned):

  let walkTokens = function(token) {
    if (!token.seen) {
      token.offset = app.tokenoffset;
      app.tokenoffset += token.raw.length;
      if (token.type == 'paragraph') app.tokenoffset += 1;
    }
    if (token.tokens) { // Mark inline elements as seen as to not double-count them
      for (let item of token.tokens) item.seen = true;
    }
  };

The app object is my global state. app.tokenoffset needs to be initialized to 0 before each render.

bartnv avatar Oct 05 '21 22:10 bartnv

Looks like raw should be fixed in v4.0.9 #2341

UziTech avatar Jan 06 '22 15:01 UziTech

walkTokens full example for token position

following @calculuschild https://github.com/markedjs/marked/issues/2134#issuecomment-878542797

You may be able to get something to work with the walkTokens feature by tracking the sum of the token "raw" lengths and adding a property to each token with the current total. Things would get more complex once you start getting into sub-tokens though but it should be possible.

example:

    let walkTokens = (token) => {
        let subs = token.tokens || token.items;
        if(subs){
            let start = (token._start || 0);
            let subpos = 0;
            subs.forEach(sub => {
                let substart = token.raw.indexOf(sub.raw, subpos);  
                let sublen = sub.raw.length;
                sub._start = substart + start;
                sub._end = sub._start + sublen;
                subpos = substart + sublen;
            });
        }
    }

Note: simplely sum "raw" length may not work well. For example of "- [ ] Task1" , 'text' token "Task1" do not have length for "- [ ] ".
So before we sum the raw length, try a search (indexOf) within each level.

Full test case using mocha:

let {marked} = require("marked");

// ...

describe("marked walkTokens", () => {
    let walkTokens = (token) => {
        let subs = token.tokens || token.items;
        if(subs){
            let start = (token._start || 0);
            let subpos = 0;
            subs.forEach(sub => {
                let substart = token.raw.indexOf(sub.raw, subpos);  
                let sublen = sub.raw.length;
                sub._start = substart + start;
                sub._end = sub._start + sublen;
                subpos = substart + sublen;
            });
        }
    }

    let testWalk = (md) => {
        let tokens = marked.lexer(md);
        let vroot = [{
            raw: md,
            tokens,
        }];
        marked.walkTokens(vroot, walkTokens);
        return tokens;
    }


    it(`case`, function(){
        let md = 
`## Title
- [ ] Task1 [#25](https://url)
  - [ ] Task2
- Item3
- Item4

## OK
![image](https://url)
`;

        let tokens = testWalk(md);


        let tokenTask = null;
        marked.walkTokens(tokens, (token)=>{
            if(token.type == "list_item" &&  
               token.text == "Task2"){
                tokenTask = token;
            }
        });

        if(tokenTask){
            let {_start, _end} = tokenTask;

            let before = md.substring(0, _start);
            let after = md.substring(_end);

            let content = before + tokenTask.raw + after;

            assert(content == md);
        }

    });

Tokens output: tokens with _start and _end position in source markdown string

tokens json detail:
[
  {
    "raw": "## Title\n- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n- Item3\n- Item4\n\n## OK\n![image](https://url)\n",
    "tokens": [
      {
        "type": "heading",
        "raw": "## Title\n",
        "depth": 2,
        "text": "Title",
        "tokens": [
          {
            "type": "text",
            "raw": "Title",
            "text": "Title",
            "_start": 3,
            "_end": 8
          }
        ],
        "_start": 0,
        "_end": 9
      },
      {
        "type": "list",
        "raw": "- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n- Item3\n- Item4",
        "ordered": false,
        "start": "",
        "loose": false,
        "items": [
          {
            "type": "list_item",
            "raw": "- [ ] Task1 [#25](https://url)\n  - [ ] Task2\n",
            "task": true,
            "checked": false,
            "loose": false,
            "text": "Task1 [#25](https://url)\n- [ ] Task2",
            "tokens": [
              {
                "type": "text",
                "raw": "Task1 [#25](https://url)\n",
                "text": "Task1 [#25](https://url)",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Task1 ",
                    "text": "Task1 ",
                    "_start": 15,
                    "_end": 21
                  },
                  {
                    "type": "link",
                    "raw": "[#25](https://url)",
                    "href": "https://url",
                    "title": null,
                    "text": "#25",
                    "tokens": [
                      {
                        "type": "text",
                        "raw": "#25",
                        "text": "#25",
                        "_start": 22,
                        "_end": 25
                      }
                    ],
                    "_start": 21,
                    "_end": 39
                  }
                ],
                "_start": 15,
                "_end": 40
              },
              {
                "type": "list",
                "raw": "- [ ] Task2",
                "ordered": false,
                "start": "",
                "loose": false,
                "items": [
                  {
                    "type": "list_item",
                    "raw": "- [ ] Task2",
                    "task": true,
                    "checked": false,
                    "loose": false,
                    "text": "Task2",
                    "tokens": [
                      {
                        "type": "text",
                        "raw": "Task2",
                        "text": "Task2",
                        "tokens": [
                          {
                            "type": "text",
                            "raw": "Task2",
                            "text": "Task2",
                            "_start": 48,
                            "_end": 53
                          }
                        ],
                        "_start": 48,
                        "_end": 53
                      }
                    ],
                    "_start": 42,
                    "_end": 53
                  }
                ],
                "_start": 42,
                "_end": 53
              }
            ],
            "_start": 9,
            "_end": 54
          },
          {
            "type": "list_item",
            "raw": "- Item3\n",
            "task": false,
            "checked": false,
            "loose": false,
            "text": "Item3",
            "tokens": [
              {
                "type": "text",
                "raw": "Item3",
                "text": "Item3",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Item3",
                    "text": "Item3",
                    "_start": 56,
                    "_end": 61
                  }
                ],
                "_start": 56,
                "_end": 61
              }
            ],
            "_start": 54,
            "_end": 62
          },
          {
            "type": "list_item",
            "raw": "- Item4",
            "task": false,
            "checked": false,
            "loose": false,
            "text": "Item4",
            "tokens": [
              {
                "type": "text",
                "raw": "Item4",
                "text": "Item4",
                "tokens": [
                  {
                    "type": "text",
                    "raw": "Item4",
                    "text": "Item4",
                    "_start": 64,
                    "_end": 69
                  }
                ],
                "_start": 64,
                "_end": 69
              }
            ],
            "_start": 62,
            "_end": 69
          }
        ],
        "_start": 9,
        "_end": 69
      },
      {
        "type": "space",
        "raw": "\n\n",
        "_start": 69,
        "_end": 71
      },
      {
        "type": "heading",
        "raw": "## OK\n",
        "depth": 2,
        "text": "OK",
        "tokens": [
          {
            "type": "text",
            "raw": "OK",
            "text": "OK",
            "_start": 74,
            "_end": 76
          }
        ],
        "_start": 71,
        "_end": 77
      },
      {
        "type": "paragraph",
        "raw": "![image](https://url)\n",
        "text": "![image](https://url)",
        "tokens": [
          {
            "type": "image",
            "raw": "![image](https://url)",
            "href": "https://url",
            "title": null,
            "text": "image",
            "_start": 77,
            "_end": 98
          }
        ],
        "_start": 77,
        "_end": 99
      }
    ]
  }
]

derekhu avatar Feb 20 '22 02:02 derekhu

Now my question is: how to render extra information from token into html, like _start, _end, shown above . The renderer doesn’t pass the token to rendering functions. so there are no extra token data can be render to DOM.

for interactive application, we can’t get token info from the DOM users action on . Like click - [ ] task checkbox, toggle - [ ] to - [x] etc.

derekhu avatar Feb 20 '22 05:02 derekhu

To get a checkbox change event you can just use JavaScript

document.querySelector("input[type=checkbox]").addEventListener("change", () => {...})

UziTech avatar Feb 20 '22 06:02 UziTech

To get a checkbox change event you can just use JavaScript

document.querySelector("input[type=checkbox]").addEventListener("change", () => {...})

Event is simple. Then, how to know which task item should be updated?

for example:

- [ ] Task1
  - [ ] task1.1
- [ ] Task1
- [ ] Task3

when user click checkbox of the second Task1 , which text should we update and toggle? By searching the text from <li> ? consider more complex scenes, we need hints from DOM. If we pass the token to renderer, I can render token position _start _end to DOM <li>,then on the click handler, We can get text from position, toggle it , and update back to main content.

derekhu avatar Feb 20 '22 06:02 derekhu

marked is only meant for converting markdown to html. For anything else you will need other tools.

UziTech avatar Feb 20 '22 07:02 UziTech

marked is only meant for converting markdown to html. For anything else you will need other tools.

well, I think marked.js is great. More interative things can be done through extension.

For the title of this issue "Position of a token in the source string", I have done thourgh marked.js's great extension way. Using walkTokens, and raw, a good strucutre of token and tokens. This is the most great thing comparing to other library.

However, we just talk about the "converting markdown to html" . The html generated from token can have more extensibility.

So, just like @bartnv mentioned, he is building a notes app that uses Marked to render. Interactive features can be done in an extended way.

I am doing the similay thing. And markedjs is so great for me. If it can provide more extensibility between token tree and html rendering, it will be better.

Detail:

https://marked.js.org/using_pro#block-level-renderer-methods

image

Thanks for reply

derekhu avatar Feb 20 '22 07:02 derekhu

I'm looking for the feature too. I think there is certainly some demand for the feature for example passing the parsed markdown tokens (with lines and characters info) to a text editor e.g. VSCode so one can use it to develop an extension. Afaik there is no library that is fully capable to do it.

I will try to take a look the code and develop the feature when I got a free time. But I'd love to know if you have a fork when you completed the patching but you cannot simply pull request the work to the repository.

yuis-ice avatar Feb 22 '22 10:02 yuis-ice

I have been maintaining a patch for line numbers: https://github.com/9001/copyparty/blob/hovudstraum/scripts/deps-docker/marked-ln.patch however, it has many problems:

  • there is an off-by-one at the start of every ~~table~~ list
  • i don't think it will be possible to add higher accuracy than line numbers
  • very hacky :p

9001 avatar Feb 22 '22 11:02 9001

It seems like to get this working we need to have something like token type (block or inline) in the walkTokens function.

UziTech avatar Feb 22 '22 19:02 UziTech

I've revisited this with version 4.0.12. The off-by-one errors in the token.raw are indeed gone, thanks for that UziTech. The only thing I would need from core markedjs is for each block level token in walkTokens to have the offset to the first child (inline) token within its raw string. So for a paragraph that would be the number of spaces and '#' characters at its start. For a list_item that would be leading spaces, '*' or '-' characters and a possible checkbox. Etc.

I've looked briefly at how this could be accomplished. It would require setting the 'd' flag on the block level regexes to get the 'indices' property on the result that specifies the offsets of the submatches within the match. The tokenizer can then add this offset to the token object. If you'd be willing to consider this then I can prepare a PR.

bartnv avatar Feb 25 '22 19:02 bartnv

@bartnv that would be great if you could create a PR. The one thing I would want to watch out for is bringing down the speed of marked. You can run npm run bench to run marked against commonmark.js and markdown-it. That will run the common mark specs against each 1000 times currently marked is a bit behind the others because the spec contains mostly edge cases.

UziTech avatar Feb 25 '22 21:02 UziTech