polars Unicode Normalize with Python

Unicode Normalize with Python

Open Hyprnx opened this issue 2 years ago • 3 comments

Problem description

In Pandas, there is a method that are able to normalize text data with NFC, NFD, NFKC and NFKD standard, i have looked in the Python documentation of Polars, but it seems like Polar team haven't implemented this yet.

This feature would be nice to have in the upcoming Polar's updates!

Thanks!

Dec 13 '22 11:12 Hyprnx

The following rust crate could be used to implement it eventually: https://docs.rs/unicode-normalization/latest/unicode_normalization/

Dec 16 '22 13:12 ghuls

It is easy to implement yourself with an apply (basically what pandas does:

import polars as pl
import unicodedata

In [72]: a = pl.Series("a", ["\u00C70123456", "\u00C70123456", "\u00C70123456"])

In [73]: a
Out[73]: 
shape: (3,)
Series: 'a' [str]
[
	"Ç0123456"
	"Ç0123456"
	"Ç0123456"
]

In [74]: b = a.apply(lambda x: unicodedata.normalize("NFD", x))

In [75]: b
Out[75]: 
shape: (3,)
Series: 'a' [str]
[
	"Ç0123456"
	"Ç0123456"
	"Ç0123456"
]


# Length of first string in the series, before and after normalization.
In [76]: len(a[0])
Out[76]: 8

In [77]: len(b[0])
Out[77]: 9

Dec 16 '22 14:12 ghuls

Sorry for the late reply. I wonder if apply function provide faster speed compare with pandas itself? Thanks

Dec 29 '22 12:12 Hyprnx

polars polars copied to clipboard

Unicode Normalize with Python

Problem description

polars
polars copied to clipboard