mdast-util-to-markdown Emojis can cause broken result

Initial checklist

[x] I read the support docs
[x] I read the contributing guide
[x] I agree to follow the code of conduct
[x] I searched issues and discussions and couldn’t find anything (or linked relevant results below)

Affected package

Steps to reproduce

const mdast = {
  type: 'root',
  children: [
    {
      type: 'paragraph',
      children: [
        {
          type: 'text',
          value: '💡',
          position: {
            start: {
              line: 8,
              column: 10,
              offset: 113
            },
            end: {
              line: 8,
              column: 12,
              offset: 115
            }
          }
        },
        {
          type: 'strong',
          children: [
            {
              type: 'text',
              value: ' Some tex',
              position: {
                start: {
                  line: 8,
                  column: 20,
                  offset: 123
                },
                end: {
                  line: 8,
                  column: 29,
                  offset: 132
                }
              }
            }
          ],
          position: {
            start: {
              line: 8,
              column: 12,
              offset: 115
            },
            end: {
              line: 8,
              column: 38,
              offset: 141
            }
          }
        },
        {
          type: 'text',
          value: 't',
          position: {
            start: {
              line: 8,
              column: 38,
              offset: 141
            },
            end: {
              line: 8,
              column: 39,
              offset: 142
            }
          }
        }
      ],
      position: {
        start: {
          line: 8,
          column: 7,
          offset: 110
        },
        end: {
          line: 8,
          column: 43,
          offset: 146
        }
      }
    }
  ],
  position: {
    start: {
      line: 1,
      column: 1,
      offset: 0
    },
    end: {
      line: 12,
      column: 1,
      offset: 176
    }
  }
}
const md = toMarkdown(mdast)
console.log('md:', md)

Actual behavior

�&#xDCA1;**&#x20;Some tex**t

Expected behavior

💡** Some tex**t

Runtime

[email protected]

Package manager

[email protected]

Operating system

macOS Sequoia 15.3.1

Build and bundle tools

No response

Mar 11 '25 02:03 craftzdog

In container-phrasing.js, it uses slice() many times, but it can break emojis. A workaround would be to use Array.from, something like:

> Array.from('👩‍👩‍👧‍👦')
[
  '👩', '‍',   '👩',
  '‍',   '👧', '‍',
  '👦'
]

> '👩‍👩‍👧‍👦'.split('')
[
  '\ud83d', '\udc69',
  '‍',       '\ud83d',
  '\udc69', '‍',
  '\ud83d', '\udc67',
  '‍',       '\ud83d',
  '\udc66'
]

> '👩‍👩‍👧‍👦'.slice(-1)
'\udc66'
> Array.from('👩‍👩‍👧‍👦').slice(-1).toString()
'👦'
>

Mar 11 '25 02:03 craftzdog

At least your expected behavior is wrong also.

💡** Some tex**t

💡** Some tex**t

Jul 04 '25 07:07 JounQin

The alternative is 💡** Some tex**t -> 💡** Some tex**t I believe. Not the “expected behavior” listed above.

Jul 04 '25 08:07 wooorm

The work to be done is related to https://github.com/syntax-tree/mdast-util-to-markdown/commit/97fb818123169d996f2afc79a4611bbd81d8f2e1, and indeed has to do with what “characters” are.

Jul 04 '25 08:07 wooorm

Something like https://www.npmjs.com/package/unicode-substring could be used? Or we just made our own alternative.

Jul 04 '25 08:07 JounQin

Maybe!

There is currently work happening in CommonMark to improve this (particularly about CJK, but I think it touches on surrogates like here too). So this is a bit undefined behavior right now and that spec + the new tests will probably affect this here. So, that may or may not mean something for encodeInfo use.

Then, there is encodeCharacterReference. Something like unicode-substring is probably needed there, though the code is very simple, so I think it is likely that those some algorithms would be inlined here!

Jul 04 '25 18:07 wooorm

@craftzdog is this something you want to look at implementing?

Jul 18 '25 16:07 wooorm

Hi! Thanks for looking into this. I got a bug report from my app user, where certain html data like this can break the app's behavior:

<!DOCTYPE html>
<html>
<head>
    <title>Reproduce inkdrop issue</title>
</head>
<body>
    <div>
        <p>
            💡<strong> Some tex</strong>t
        </p>
    </div>
</body>
</html>

That's why I tested the mdast example above. Could you tell me what the expected behavior should be in this case? Maybe, 💡 **Some tex**t?

Jul 22 '25 06:07 craftzdog

The alternative is 💡** Some tex**t -> 💡** Some tex**t I believe. Not the “expected behavior” listed above.

@craftzdog The expected behavior is already provided.

Jul 22 '25 06:07 JounQin

ah got it. I see it produces foo** Some tex**t from foo<strong> Some tex</strong>t. so, based on the current behavior, I guess 💡** Some tex**t would be the expected behavior.

Jul 22 '25 06:07 craftzdog

it produces foo** Some tex**t from foo\<strong\> Some tex\</strong\>t

Oh, sorry, my bad, I mean 💡** Some tex**t should be converted into 💡** Some tex**t, while both of them can not represent the original mdast.

There're two issues here, and I was thinking about the encoding issue only at that moment.

Jul 22 '25 06:07 JounQin

I think encoding space between ** and S to   is necessary:

Regarding the emoji encoding, it'd be nice to avoid the conversion for better readability in Markdown. I agree with your suggestion on using a library such as unicode-substring.

Jul 22 '25 07:07 craftzdog