mdast-util-to-markdown icon indicating copy to clipboard operation
mdast-util-to-markdown copied to clipboard

Emojis can cause broken result

Open craftzdog opened this issue 1 year ago β€’ 12 comments

Initial checklist

Affected package

[email protected]

Steps to reproduce

const mdast = {
  type: 'root',
  children: [
    {
      type: 'paragraph',
      children: [
        {
          type: 'text',
          value: 'πŸ’‘',
          position: {
            start: {
              line: 8,
              column: 10,
              offset: 113
            },
            end: {
              line: 8,
              column: 12,
              offset: 115
            }
          }
        },
        {
          type: 'strong',
          children: [
            {
              type: 'text',
              value: ' Some tex',
              position: {
                start: {
                  line: 8,
                  column: 20,
                  offset: 123
                },
                end: {
                  line: 8,
                  column: 29,
                  offset: 132
                }
              }
            }
          ],
          position: {
            start: {
              line: 8,
              column: 12,
              offset: 115
            },
            end: {
              line: 8,
              column: 38,
              offset: 141
            }
          }
        },
        {
          type: 'text',
          value: 't',
          position: {
            start: {
              line: 8,
              column: 38,
              offset: 141
            },
            end: {
              line: 8,
              column: 39,
              offset: 142
            }
          }
        }
      ],
      position: {
        start: {
          line: 8,
          column: 7,
          offset: 110
        },
        end: {
          line: 8,
          column: 43,
          offset: 146
        }
      }
    }
  ],
  position: {
    start: {
      line: 1,
      column: 1,
      offset: 0
    },
    end: {
      line: 12,
      column: 1,
      offset: 176
    }
  }
}
const md = toMarkdown(mdast)
console.log('md:', md)

Actual behavior

��** Some tex**t

Expected behavior

πŸ’‘** Some tex**t

Runtime

[email protected]

Package manager

[email protected]

Operating system

macOS Sequoia 15.3.1

Build and bundle tools

No response

craftzdog avatar Mar 11 '25 02:03 craftzdog

In container-phrasing.js, it uses slice() many times, but it can break emojis. A workaround would be to use Array.from, something like:

> Array.from('πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦')
[
  'πŸ‘©', '‍',   'πŸ‘©',
  '‍',   'πŸ‘§', '‍',
  'πŸ‘¦'
]

> 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.split('')
[
  '\ud83d', '\udc69',
  '‍',       '\ud83d',
  '\udc69', '‍',
  '\ud83d', '\udc67',
  '‍',       '\ud83d',
  '\udc66'
]

> 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.slice(-1)
'\udc66'
> Array.from('πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦').slice(-1).toString()
'πŸ‘¦'
>

craftzdog avatar Mar 11 '25 02:03 craftzdog

At least your expected behavior is wrong also.

πŸ’‘** Some tex**t

πŸ’‘** Some tex**t

JounQin avatar Jul 04 '25 07:07 JounQin

The alternative is 💡** Some tex**t -> πŸ’‘** Some tex**t I believe. Not the β€œexpected behavior” listed above.

wooorm avatar Jul 04 '25 08:07 wooorm

The work to be done is related to https://github.com/syntax-tree/mdast-util-to-markdown/commit/97fb818123169d996f2afc79a4611bbd81d8f2e1, and indeed has to do with what β€œcharacters” are.

wooorm avatar Jul 04 '25 08:07 wooorm

Something like https://www.npmjs.com/package/unicode-substring could be used? Or we just made our own alternative.

JounQin avatar Jul 04 '25 08:07 JounQin

Maybe!

There is currently work happening in CommonMark to improve this (particularly about CJK, but I think it touches on surrogates like here too). So this is a bit undefined behavior right now and that spec + the new tests will probably affect this here. So, that may or may not mean something for encodeInfo use.

Then, there is encodeCharacterReference. Something like unicode-substring is probably needed there, though the code is very simple, so I think it is likely that those some algorithms would be inlined here!

wooorm avatar Jul 04 '25 18:07 wooorm

@craftzdog is this something you want to look at implementing?

wooorm avatar Jul 18 '25 16:07 wooorm

Hi! Thanks for looking into this. I got a bug report from my app user, where certain html data like this can break the app's behavior:

<!DOCTYPE html>
<html>
<head>
    <title>Reproduce inkdrop issue</title>
</head>
<body>
    <div>
        <p>
            πŸ’‘<strong> Some tex</strong>t
        </p>
    </div>
</body>
</html>

That's why I tested the mdast example above. Could you tell me what the expected behavior should be in this case? Maybe, πŸ’‘ **Some tex**t?

craftzdog avatar Jul 22 '25 06:07 craftzdog

The alternative is &#x1F4A1;** Some tex**t -> πŸ’‘** Some tex**t I believe. Not the β€œexpected behavior” listed above.

@craftzdog The expected behavior is already provided.

JounQin avatar Jul 22 '25 06:07 JounQin

ah got it. I see it produces fo&#x6F;**&#x20;Some tex**t from foo<strong> Some tex</strong>t. so, based on the current behavior, I guess &#x1F4A1;**&#x20;Some tex**t would be the expected behavior.

craftzdog avatar Jul 22 '25 06:07 craftzdog

it produces fo&#x6F;**&#x20;Some tex**t from foo\<strong\> Some tex\</strong\>t

Oh, sorry, my bad, I mean πŸ’‘** Some tex**t should be converted into &#x1F4A1;** Some tex**t, while both of them can not represent the original mdast.

There're two issues here, and I was thinking about the encoding issue only at that moment.

JounQin avatar Jul 22 '25 06:07 JounQin

I think encoding space between ** and S to &#x20; is necessary:

Image

Regarding the emoji encoding, it'd be nice to avoid the conversion for better readability in Markdown. I agree with your suggestion on using a library such as unicode-substring.

craftzdog avatar Jul 22 '25 07:07 craftzdog