Emojis can cause broken result
Initial checklist
- [x] I read the support docs
- [x] I read the contributing guide
- [x] I agree to follow the code of conduct
- [x] I searched issues and discussions and couldnβt find anything (or linked relevant results below)
Affected package
Steps to reproduce
const mdast = {
type: 'root',
children: [
{
type: 'paragraph',
children: [
{
type: 'text',
value: 'π‘',
position: {
start: {
line: 8,
column: 10,
offset: 113
},
end: {
line: 8,
column: 12,
offset: 115
}
}
},
{
type: 'strong',
children: [
{
type: 'text',
value: ' Some tex',
position: {
start: {
line: 8,
column: 20,
offset: 123
},
end: {
line: 8,
column: 29,
offset: 132
}
}
}
],
position: {
start: {
line: 8,
column: 12,
offset: 115
},
end: {
line: 8,
column: 38,
offset: 141
}
}
},
{
type: 'text',
value: 't',
position: {
start: {
line: 8,
column: 38,
offset: 141
},
end: {
line: 8,
column: 39,
offset: 142
}
}
}
],
position: {
start: {
line: 8,
column: 7,
offset: 110
},
end: {
line: 8,
column: 43,
offset: 146
}
}
}
],
position: {
start: {
line: 1,
column: 1,
offset: 0
},
end: {
line: 12,
column: 1,
offset: 176
}
}
}
const md = toMarkdown(mdast)
console.log('md:', md)
Actual behavior
��** Some tex**t
Expected behavior
π‘** Some tex**t
Runtime
Package manager
Operating system
macOS Sequoia 15.3.1
Build and bundle tools
No response
In container-phrasing.js, it uses slice() many times, but it can break emojis.
A workaround would be to use Array.from, something like:
> Array.from('π©βπ©βπ§βπ¦')
[
'π©', 'β', 'π©',
'β', 'π§', 'β',
'π¦'
]
> 'π©βπ©βπ§βπ¦'.split('')
[
'\ud83d', '\udc69',
'β', '\ud83d',
'\udc69', 'β',
'\ud83d', '\udc67',
'β', '\ud83d',
'\udc66'
]
> 'π©βπ©βπ§βπ¦'.slice(-1)
'\udc66'
> Array.from('π©βπ©βπ§βπ¦').slice(-1).toString()
'π¦'
>
At least your expected behavior is wrong also.
π‘** Some tex**t
π‘** Some tex**t
The alternative is 💡** Some tex**t -> π‘** Some tex**t I believe. Not the βexpected behaviorβ listed above.
The work to be done is related to https://github.com/syntax-tree/mdast-util-to-markdown/commit/97fb818123169d996f2afc79a4611bbd81d8f2e1, and indeed has to do with what βcharactersβ are.
Something like https://www.npmjs.com/package/unicode-substring could be used? Or we just made our own alternative.
Maybe!
There is currently work happening in CommonMark to improve this (particularly about CJK, but I think it touches on surrogates like here too). So this is a bit undefined behavior right now and that spec + the new tests will probably affect this here.
So, that may or may not mean something for encodeInfo use.
Then, there is encodeCharacterReference. Something like unicode-substring is probably needed there, though the code is very simple, so I think it is likely that those some algorithms would be inlined here!
@craftzdog is this something you want to look at implementing?
Hi! Thanks for looking into this. I got a bug report from my app user, where certain html data like this can break the app's behavior:
<!DOCTYPE html>
<html>
<head>
<title>Reproduce inkdrop issue</title>
</head>
<body>
<div>
<p>
π‘<strong> Some tex</strong>t
</p>
</div>
</body>
</html>
That's why I tested the mdast example above.
Could you tell me what the expected behavior should be in this case?
Maybe, π‘ **Some tex**t?
The alternative is
💡** Some tex**t-> π‘** Some tex**t I believe. Not the βexpected behaviorβ listed above.
@craftzdog The expected behavior is already provided.
ah got it. I see it produces foo** Some tex**t from foo<strong> Some tex</strong>t.
so, based on the current behavior, I guess 💡** Some tex**t would be the expected behavior.
it produces
foo** Some tex**tfromfoo\<strong\> Some tex\</strong\>t
Oh, sorry, my bad, I mean π‘** Some tex**t should be converted into 💡** Some tex**t, while both of them can not represent the original mdast.
There're two issues here, and I was thinking about the encoding issue only at that moment.
I think encoding space between ** and S to   is necessary:
Regarding the emoji encoding, it'd be nice to avoid the conversion for better readability in Markdown.
I agree with your suggestion on using a library such as unicode-substring.