marked Lexer handling newlines incorrectly in some cases

Lexer handling newlines incorrectly in some cases

Open l3dotdev opened this issue 1 year ago • 1 comments

Marked version: 11.1.1

Describe the bug When using the lexer it seems to leave newlines at the end of some tokens instead of tokenizing them

To Reproduce Input (hr):

console.log(lexer.lex("---------------------------------\n\nhi"))

Output (hr):

[
    {
        "type": "hr",
        "raw": "---------------------------------\n\n"
    },
    {
        "type": "paragraph",
        "raw": "hi",
        "text": "hi",
        "tokens": [
            {
                "type": "text",
                "raw": "hi",
                "text": "hi"
            }
        ]
    }
]

and input (blockquote):

console.log(lexer.lex("> blockquote\n\nhi"))

Output (blockquote):

[
    {
        "type": "blockquote",
        "raw": "> blockquote\n\n",
        "tokens": [
            {
                "type": "paragraph",
                "raw": "blockquote",
                "text": "blockquote",
                "tokens": [
                    {
                        "type": "text",
                        "raw": "blockquote",
                        "text": "blockquote"
                    }
                ]
            }
        ],
        "text": "blockquote"
    },
    {
        "type": "paragraph",
        "raw": "hi",
        "text": "hi",
        "tokens": [
            {
                "type": "text",
                "raw": "hi",
                "text": "hi"
            }
        ]
    }
]

For both of these examples you can see that the 2 newlines are being ignored and not tokenized by the lexer. This is with gfm: true and breaks: true

Expected behavior For hr input:

[
    {
        "type": "hr",
        "raw": "---------------------------------"
    },
    {
        "type": "space",
        "raw": "\n\n"
    },
    {
        "type": "paragraph",
        "raw": "hi",
        "text": "hi",
        "tokens": [
            {
                "type": "text",
                "raw": "hi",
                "text": "hi"
            }
        ]
    }
]

For blockquote input:

[
    {
        "type": "blockquote",
        "raw": "> blockquote",
        "tokens": [
            {
                "type": "paragraph",
                "raw": "blockquote",
                "text": "blockquote",
                "tokens": [
                    {
                        "type": "text",
                        "raw": "blockquote",
                        "text": "blockquote"
                    }
                ]
            }
        ],
        "text": "blockquote"
    },
    {
        "type": "br",
        "raw": "\n"
    },
    {
        "type": "paragraph",
        "raw": "hi",
        "text": "hi",
        "tokens": [
            {
                "type": "text",
                "raw": "hi",
                "text": "hi"
            }
        ]
    }
]

Jan 18 '24 22:01 l3dotdev

The space token is used in places where it is needed. For example if two paragraphs are next to each other they become one paragraph token unless there is a blank line (space token) between them.

If you want to create a PR to add space tokens after each block token that would be fine, but I think it will be a breaking change.

Jan 19 '24 06:01 UziTech

marked marked copied to clipboard

Lexer handling newlines incorrectly in some cases

marked
marked copied to clipboard