Implement Optimal String Alignment (OSA) Distance Algorithm
Summary
This pull request introduces an implementation of the Optimal String Alignment (OSA) Distance algorithm. This string metric is used to measure the difference between two sequences (typically strings) by calculating the minimum number of operations required to transform one string into another. The operations considered include insertion, deletion, substitution, and the transposition of two adjacent characters.
What is the Optimal String Alignment (OSA) Distance?
The Optimal String Alignment (OSA) distance, also referred to as the restricted Damerau-Levenshtein distance, is a variation of the classic Levenshtein distance with an additional operation—transposition of adjacent characters. This metric is particularly useful in scenarios where such transpositions are common, such as typographical errors or spelling mistakes.
Key Operations:
- Insertion: Add a character to the string.
- Deletion: Remove a character from the string.
- Substitution: Replace one character with another.
- Transposition: Swap two adjacent characters.
Difference from Damerau-Levenshtein Distance:
While the OSA distance allows for transpositions like the general Damerau-Levenshtein distance, it differs in that it restricts the transposition to be a single operation, ensuring that the same characters are not involved in multiple operations in the same position. This makes OSA more suitable for applications where such operations are expected to be simple, like correcting minor spelling errors.
How It Works
The algorithm uses dynamic programming to compute the distance. The main idea is to build a matrix where each cell (i, j) represents the OSA distance between the first i characters of string s1 and the first j characters of string s2.
The algorithm proceeds as follows:
-
Initialization:
- The matrix is initialized with base cases where one string is empty.
-
Filling the Matrix:
- For each character in
s1ands2, calculate the cost of insertion, deletion, substitution, and transposition. - The value at each matrix cell is the minimum of these costs.
- For each character in
-
Final Output:
- The OSA distance is found in the cell corresponding to the comparison of the entire lengths of
s1ands2.
- The OSA distance is found in the cell corresponding to the comparison of the entire lengths of
Example
Consider two strings: "example" and "exmaple".
The OSA distance between these two strings is 1 because you can transform "exmaple" into "example" by a single transposition of the characters 'm' and 'a'.
Another example:
int distance = OptimalStringAlignmentDistance("kitten", "sitting");
// Output: 3
Here, the distance is 3 due to the following operations:
- Substitution of 'k' with 's'.
- Substitution of 'e' with 'i'.
- Insertion of 'g' at the end.
Motivation
Why Use OSA Distance?
The OSA distance is particularly advantageous in applications where adjacent character transpositions are common. This is typically the case in the following scenarios:
- Typographical Error Detection: When users accidentally swap two adjacent characters while typing.
- Spelling Correction: In applications like search engines and text editors where suggestions for misspelled words are needed.
- Natural Language Processing (NLP): In tasks like fuzzy string matching where small errors in word forms are common and should be efficiently handled.
Time Complexity
The time complexity of the OSA distance algorithm is O(n * m), where n is the length of the first string and m is the length of the second string. This makes it efficient for moderate-length strings but may become computationally expensive for very long strings. However, this complexity is comparable to other similar algorithms, such as Levenshtein and Damerau-Levenshtein, making OSA a practical choice for many real-world applications.
- [x] I have performed a self-review of my code
- [x] My code follows the style guidelines of this project
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] Comments in areas I changed are up to date
- [x] I have added comments to hard-to-understand areas of my code
- [x] I have made corresponding changes to the README.md
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 95.04%. Comparing base (
6b37d04) to head (c4e0e1a). Report is 1 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #464 +/- ##
==========================================
+ Coverage 94.93% 95.04% +0.10%
==========================================
Files 240 241 +1
Lines 10156 10213 +57
Branches 1441 1450 +9
==========================================
+ Hits 9642 9707 +65
+ Misses 395 389 -6
+ Partials 119 117 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I’m a bit confused about the Codecov report here. Even though this PR shows 100% coverage for the changes made, there are two indirect changes that are pulling down the overall coverage, which is blocking the PR.
I think this might be related to the fact that I refreshed my branch with the latest master, but I’m not entirely sure. What do you think about ignoring the Codecov warning for now, lowering the overall coverage, and opening an issue to address the coverage for the two classes that are not hitting the mark? This seems like the best route, though there’s a chance the fix might not happen.
Alternatively, I could add some unit tests to cover those two classes in this PR, but that would expand the scope quite a bit.
Let me know your thoughts!
I have opened an issue, and will try to fix the coverage in HashMap and TimSort, so let's keep this PR open for now, until I have a PR that closes #465