ceja
ceja copied to clipboard
PySpark phonetic and string matching algorithms
ceja
PySpark phonetic, stemming, and string matching algorithms. Use the power of PySpark to run these algos on massive datasets!
Installation and basic usage
Run pip install ceja
to install the library.
Import the functions with import ceja
. After importing the code you can run functions like ceja.nysiis
, ceja.jaro_winkler_similarity
, etc.
Public interface summary
- Phonetic algorithms
- nysiis
- metaphone
- match_rating_codex
- Stemming
- porter_stem
- String similarity
- damerau_levenshtein_distance
- hamming_distance
- jaro_similarity
- jaro_winkler_similarity
- match_rating_comparison
Phonetic algorithms
NYSIIS
data = [
("jellyfish",),
("li",),
("luisa",),
(None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_nysiis", ceja.nysiis(col("word")))
actual_df.show()
+---------+-----------+
| word|word_nysiis|
+---------+-----------+
|jellyfish| JALYF|
| li| L|
| luisa| LAS|
| null| null|
+---------+-----------+
Metaphone
data = [
("jellyfish",),
("li",),
("luisa",),
("Klumpz",),
("Clumps",),
(None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_metaphone", ceja.metaphone(col("word")))
actual_df.show()
+---------+--------------+
| word|word_metaphone|
+---------+--------------+
|jellyfish| JLFX|
| li| L|
| luisa| LS|
| Klumpz| KLMPS|
| Clumps| KLMPS|
| null| null|
+---------+--------------+
Match rating codex
data = [
("jellyfish",),
("li",),
("luisa",),
(None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_match_rating_codex", ceja.match_rating_codex(col("word")))
actual_df.show()
+---------+-----------------------+
| word|word_match_rating_codex|
+---------+-----------------------+
|jellyfish| JLYFSH|
| li| L|
| luisa| LS|
| null| null|
+---------+-----------------------+
Stemming algorithms
Porter stem
data = [
("chocolates",),
("chocolatey",),
("choco",),
(None,)
]
df = spark.createDataFrame(data, ["word"])
actual_df = df.withColumn("word_porter_stem", ceja.porter_stem(col("word")))
actual_df.show()
+----------+----------------+
| word|word_porter_stem|
+----------+----------------+
|chocolates| chocol|
|chocolatey| chocolatei|
| choco| choco|
| null| null|
+----------+----------------+
Similarity algorithms
Damerau Levenshtein Distance
data = [
("jellyfish", "smellyfish"),
("li", "lee"),
("luisa", "bruna"),
(None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("damerau_levenshtein_distance", ceja.damerau_levenshtein_distance(col("word1"), col("word2")))
actual_df.show()
+---------+----------+----------------------------+
| word1| word2|damerau_levenshtein_distance|
+---------+----------+----------------------------+
|jellyfish|smellyfish| 2|
| li| lee| 2|
| luisa| bruna| 4|
| null| null| null|
+---------+----------+----------------------------+
Hamming distance
data = [
("jellyfish", "smellyfish"),
("li", "lee"),
("luisa", "bruna"),
(None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("hamming_distance", ceja.hamming_distance(col("word1"), col("word2")))
print("\nHamming distance")
actual_df.show()
+---------+----------+----------------+
| word1| word2|hamming_distance|
+---------+----------+----------------+
|jellyfish|smellyfish| 9|
| li| lee| 2|
| luisa| bruna| 4|
| null| null| null|
+---------+----------+----------------+
Jaro similarity
data = [
("jellyfish", "smellyfish"),
("li", "lee"),
("luisa", "bruna"),
("hi", "colombia"),
(None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_similarity", ceja.jaro_similarity(col("word1"), col("word2")))
actual_df.show()
+---------+----------+---------------+
| word1| word2|jaro_similarity|
+---------+----------+---------------+
|jellyfish|smellyfish| 0.8962963|
| li| lee| 0.6111111|
| luisa| bruna| 0.6|
| hi| colombia| 0.0|
| null| null| null|
+---------+----------+---------------+
Jaro Winkler similarity
data = [
("jellyfish", "smellyfish"),
("li", "lee"),
("luisa", "bruna"),
(None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(col("word1"), col("word2")))
actual_df.show()
+---------+----------+-----------------------+
| word1| word2|jaro_winkler_similarity|
+---------+----------+-----------------------+
|jellyfish|smellyfish| 0.8962963|
| li| lee| 0.6111111|
| luisa| bruna| 0.6|
| null| null| null|
+---------+----------+-----------------------+
Match rating comparison
data = [
("mat", "matt"),
("there", "their"),
("luisa", "bruna"),
(None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("match_rating_comparison", ceja.match_rating_comparison(col("word1"), col("word2")))
actual_df.show()
+-----+-----+-----------------------+
|word1|word2|match_rating_comparison|
+-----+-----+-----------------------+
| mat| matt| true|
|there|their| true|
|luisa|bruna| false|
| null| null| null|
+-----+-----+-----------------------+
Contributing
Contributions are welcome and encouraged. Feel free to open issues or send pull requests.
If you make a lot of good contributions, you'll be granted push access to the repo.
The best contributions to make would be implementing these functions as Spark native functions.