stdlib
stdlib copied to clipboard
Algorithm to reverse a string should account for extended grapheme clusters
Checklist
Please ensure the following tasks are completed before filing a bug report.
- [x] Read and understood the Code of Conduct.
- [x] Searched for existing issues and pull requests.
Description
Description of the issue.
Encountered an error when attempting to use @stdlib/string/reverse
to reverse a string which includes extended grapheme clusters. While the implementation does account for Unicode surrogate pairs and combining marks, extended grapheme clusters do not appear supported.
Addressing this issue may require using a Unicode segmentation algorithm, as discussed in the Unicode standard. One can find reference implementations in Rust and elsewhere.
Related Issues
Does this issue have any related issues?
Related issues #295.
Questions
Any questions for reviewers?
No.
Other
Any other information relevant to this issue? This may include screenshots, references, stack traces, sample output, and/or implementation notes.
N/A
Demo
If relevant, provide a link to a live demo.
N/A
Reproduction
What steps are required to reproduce the unexpected output?
In order to reproduce this bug, do the following:
- attempt to reverse a string which includes extended grapheme clusters.
Expected Results
What are the expected results?
The reversed string to account for extended grapheme clusters.
Actual Results
What are the actual results?
N/A
Environments
What environments are affected (e.g.,
Node v0.4.x
,Chrome
,IE 11
)? If Node.js, include thenpm
version, operating system, and any other potentially relevant platform information.
The following environments are affected:
- all
Try something like this:
function reverse(str) {
var result = "";
var smx = str.length;
var sdx = 0;
/*
Can't start at the end because each segment depends on the
previous one. So build up a list of lengths first instead.
*/
var counts = [];
for (;;) {
var skip = str.codePointAt(sdx) <= 0xffff ? 1 : 2;
counts.push(skip);
sdx += skip;
if (sdx >= smx) break;
}
var cmx = counts.length;
while (cmx--) {
var skip = counts[cmx];
/*
Now we know how much to hop backward for the next grapheme.
Write the first segment, then the second (if present).
*/
var hop = skip;
while (hop) result += str[smx - hop--];
/*
Move on to the next glyph. (Bounds-checking won't be needed
here since the counts array holds an accurate list of widths.)
*/
smx -= skip;
}
return result;
}
@gardhr Thanks for the suggestion! Feel free to submit a PR along with tests. :)
Not sure how well your proposal accounts for all graphemes, as it has been a while since I last looked into it.
Sorry, I just don't have the time at the moment. It has worked well for me over the years, I would be rather surprised if it choked on anything at all. If it does however, just give me a set of failing samples and I'll be glad to fix it.
Wish I could do more to help. Maybe once I get passed some of this backlog?
var print = console.log;
var text =
"外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲\u{2F804}";
var reversed = reverse(text);
print(text);
print();
print(reversed);
print();
print("Test:", reverse(reversed) == text ? "Pass" : "FAIL!");
Output:
外国語の学習と教授Language Learning and TeachingИзучение и обучение иностранных языков語文教學・语文教学Enseñanza y estudio de idiomasИзучаване и Преподаване на Чужди Езициქართული ენის შესწავლა და სწავლება'læŋɡwidʒ 'lɘr:niŋ ænd 'ti:tʃiŋLus kawm thaib qhiaNgôn Ngữ, Sự học,말배우기와 가르치기Nauka języków obcychΓλωσσική Εκμὰθηση και Διδασκαλίαเรียนและสอนภาษา語文教學 😃, ქართული 😭, Εκμὰθηση 😈 иностранных𠀋𠀾𠁎𠁨☃★♲你
你♲★☃𠁨𠁎𠀾𠀋хыннартсони 😈 ησηθὰμκΕ ,😭 ილუთრაქ ,😃 學教文語าษาภนอสะลแนยีรเαίλακσαδιΔ ιακ ησηθὰμκΕ ήκισσωλΓhcycbo wókyzęj akuaN기치르가 와기우배말,cọh ựS ,ữgN nôgNaihq biaht mwak suLŋiʃt:it' dnæ ŋin:rɘl' ʒdiwɡŋæl'აბელვაწს ად ალვაწსეშ სინე ილუთრაქицизЕ иджуЧ ан енавадоперП и енавачузИsamoidi ed oidutse y aznañesnE学教文语・學教文語вокызя хыннартсони еинечубо и еинечузИgnihcaeT dna gninraeL egaugnaL授教と習学の語国外
Test: Pass
@gardhr Thanks for the demo!
Yep, you're welcome.
This was resolved in https://github.com/stdlib-js/stdlib/pull/1082.