polars
polars copied to clipboard
Unicode Normalize with Python
Problem description
In Pandas, there is a method that are able to normalize text data with NFC, NFD, NFKC and NFKD standard, i have looked in the Python documentation of Polars, but it seems like Polar team haven't implemented this yet.
This feature would be nice to have in the upcoming Polar's updates!
Thanks!
The following rust crate could be used to implement it eventually: https://docs.rs/unicode-normalization/latest/unicode_normalization/
It is easy to implement yourself with an apply (basically what pandas does:
import polars as pl
import unicodedata
In [72]: a = pl.Series("a", ["\u00C70123456", "\u00C70123456", "\u00C70123456"])
In [73]: a
Out[73]:
shape: (3,)
Series: 'a' [str]
[
"Ç0123456"
"Ç0123456"
"Ç0123456"
]
In [74]: b = a.apply(lambda x: unicodedata.normalize("NFD", x))
In [75]: b
Out[75]:
shape: (3,)
Series: 'a' [str]
[
"Ç0123456"
"Ç0123456"
"Ç0123456"
]
# Length of first string in the series, before and after normalization.
In [76]: len(a[0])
Out[76]: 8
In [77]: len(b[0])
Out[77]: 9
Sorry for the late reply. I wonder if apply function provide faster speed compare with pandas itself? Thanks