stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

[RFC]: Hamming distance between two strings

Open Planeshifter opened this issue 1 year ago • 11 comments

Description

This RFC proposes adding a function to calculate the Hamming distance between two strings.

The function should have the following signature (a: string, b: string): number.

The function should take two strings as arguments and return the Hamming distance between them. The Hamming distance is defined as the number of characters that have to be changed to convert one string to the other. Since it only allows substitutions, it can only be used to compare strings of the same length.

Additionally, in order to account for code points and grapheme clusters, we should add separate packages for dealing with each, as the underlying algorithms are likely to differ. We then can provide a more general API which unifies the underlying algorithms. Accordingly, we should create the following packages:

  • [x] @stdlib/string/base/distances/hamming: compares UTF-16 code units.
    • PR: https://github.com/stdlib-js/stdlib/pull/1166
  • [ ] @stdlib/string/base/distances/hamming-code-points: compares Unicode code points.
  • [ ] @stdlib/string/base/distances/hamming-grapheme-clusters: compares grapheme clusters (i.e., visual characters)

Once the above are completed, we can add

  • [ ] @stdlib/string/distances/hamming: unifies the above "base" packages and provides an option for specifying the computation "mode" (i.e., code_units, code_points, or grapheme_clusters, with grapheme_clusters being the default).

Related Issues

Related issues https://github.com/stdlib-js/stdlib/issues/151.

Questions

No.

Other

No.

Checklist

  • [X] I have read and understood the Code of Conduct.
  • [X] Searched for existing issues and pull requests.
  • [X] The issue name begins with RFC:.

Planeshifter avatar Feb 01 '23 22:02 Planeshifter

@Planeshifter i would like to work on this issue. would you guide me more how to work on this?

Ahtaxam avatar Apr 30 '23 03:04 Ahtaxam

@Ahtaxam Are you still interested in working on this?

kgryte avatar Jul 15 '23 00:07 kgryte

Hi, I would like to work on this Can I be assgined this

marsian83 avatar Feb 29 '24 17:02 marsian83

@marsian83 Sure. Would you be open to working on @stdlib/string/base/distances/hamming-code-points?

kgryte avatar Feb 29 '24 20:02 kgryte

Note that the UTF-16 code unit implementation has already been added. The code points implementation should iterate over Unicode code points (e.g., most Unicode characters and some emoji). You can find other code points implementation in the string/base/* namespace.

kgryte avatar Feb 29 '24 20:02 kgryte

Note that the UTF-16 code unit implementation has already been added. The code points implementation should iterate over Unicode code points (e.g., most Unicode characters and some emoji). You can find other code points implementation in the string/base/* namespace.

Sure, I'll see how these work. Thanks for the reference

marsian83 avatar Mar 01 '24 04:03 marsian83

@kgryte Hello ! Can I work on the grapheme clusters part ?

mayankkamboj47 avatar Mar 02 '24 11:03 mayankkamboj47

@mayankkamboj47 Sure. You'll probably want to rely on next-grapheme-cluster-break in order to derive string indices for the purpose of comparing substrings.

kgryte avatar Mar 02 '24 11:03 kgryte

@kgryte @Planeshifter If noone is working on @stdlib/string/base/distances/hamming-code-points, can i work on it?

Pratik772846 avatar Mar 05 '24 15:03 Pratik772846

@kgryte I created a pull request for Hamming distance between Unicode characters (#1948 ). Can you please review it. Looking forward for your reply.

JayPokale avatar Mar 19 '24 19:03 JayPokale

Hello there, can someone please review PR #1948 on @stdlib/string/base/distances/hamming-code-points.

JayPokale avatar Mar 27 '24 03:03 JayPokale