polars icon indicating copy to clipboard operation
polars copied to clipboard

Unicode Normalize with Python

Open Hyprnx opened this issue 2 years ago • 3 comments

Problem description

In Pandas, there is a method that are able to normalize text data with NFC, NFD, NFKC and NFKD standard, i have looked in the Python documentation of Polars, but it seems like Polar team haven't implemented this yet.

This feature would be nice to have in the upcoming Polar's updates!

Thanks!

Hyprnx avatar Dec 13 '22 11:12 Hyprnx

The following rust crate could be used to implement it eventually: https://docs.rs/unicode-normalization/latest/unicode_normalization/

ghuls avatar Dec 16 '22 13:12 ghuls

It is easy to implement yourself with an apply (basically what pandas does:

import polars as pl
import unicodedata

In [72]: a = pl.Series("a", ["\u00C70123456", "\u00C70123456", "\u00C70123456"])

In [73]: a
Out[73]: 
shape: (3,)
Series: 'a' [str]
[
	"Ç0123456"
	"Ç0123456"
	"Ç0123456"
]

In [74]: b = a.apply(lambda x: unicodedata.normalize("NFD", x))

In [75]: b
Out[75]: 
shape: (3,)
Series: 'a' [str]
[
	"Ç0123456"
	"Ç0123456"
	"Ç0123456"
]


# Length of first string in the series, before and after normalization.
In [76]: len(a[0])
Out[76]: 8

In [77]: len(b[0])
Out[77]: 9

ghuls avatar Dec 16 '22 14:12 ghuls

Sorry for the late reply. I wonder if apply function provide faster speed compare with pandas itself? Thanks

Hyprnx avatar Dec 29 '22 12:12 Hyprnx