bistring
bistring copied to clipboard
Transliterate
I was hoping you might advise me on how to incorporate transliteration into a text transformation pipeline.
Let's say I want to use a 3rd party library like from unidecode import unidecode
.
I could create a bistring with new_bistr = bistr(text.modified, unidecode(text.modified))
but I would loose all the previous operations.
Is there a way to fold in a modified string that is calculated outside bistring's capabilities?
In general no. You could use something like bistr.infer(text, unidecode(text))
to have it guess.
In your case, you could do a little better since the transliteration process probably operates character-by-character. Something like
tokenizer = CharacterTokenizer('und') # or 'en-US', etc.
builder = BistrBuilder(text)
for token in tokenzier.tokenize(text):
builder.replace(token.end - token.start, unidecode(token.modified))
text = builder.build()
By the way, it's on my backlog to implement support for ICU's Transliterator API which is more powerful than unidecode and similar things.
So since https://github.com/ovalhub/pyicu/issues/107 was implemented, I've tested out an implementation that wraps a bistr
in a Replaceable
for ICU. It works well for simple transliterations like Latin-ASCII
, but for complicated ones like Greek-Latin
ICU does some strange things that I'm not sure how to cope with nicely:
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffff')
('Ὀδυσσεύς' ⇋ 'Ὀδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς\uffffO')
('Ὀδυσσεύς' ⇋ 'OὈδυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύςO')
('Ὀδυσσεύς' ⇋ 'O̓δυσσεύς')
('Ὀδυσσεύς' ⇋ 'Oδυσσεύς')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςO')
('Ὀδυσσεύς' ⇋ 'OδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'OdδυσσεύςOd')
('Ὀδυσσεύς' ⇋ 'Odδυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςd')
('Ὀδυσσεύς' ⇋ 'Odυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύςdy')
('Ὀδυσσεύς' ⇋ 'Odyυσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύς')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςy')
('Ὀδυσσεύς' ⇋ 'Odyσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύςys')
('Ὀδυσσεύς' ⇋ 'Odysσσεύς')
('Ὀδυσσεύς' ⇋ 'Odysσεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύς')
('Ὀδυσσεύς' ⇋ 'Odyssεύςs')
('Ὀδυσσεύς' ⇋ 'Odyssεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύςse')
('Ὀδυσσεύς' ⇋ 'Odysseεύς')
('Ὀδυσσεύς' ⇋ 'Odysseύς')
('Ὀδυσσεύς' ⇋ 'Odysseύςe')
('Ὀδυσσεύς' ⇋ 'Odysseύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύςeu')
('Ὀδυσσεύς' ⇋ 'Odysseuύς')
('Ὀδυσσεύς' ⇋ 'Odysseúς')
('Ὀδυσσεύς' ⇋ 'Odysseúς́')
('Ὀδυσσεύς' ⇋ 'Odysseúς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς́s')
('Ὀδυσσεύς' ⇋ 'Odysseúsς')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
('Ὀδυσσεύς' ⇋ 'Odysseús')
Thank you for the great info and tips. Agreed that transliteration doesn't always make sense to do, e.g., your example.
I realize now why I didn't think to do it the way you mentioned. I had it in my mind that bistr keeps track of each operations output instead of always overriding modified, i.e., modified is a list so one could rollback to a certain state. I had built this into my own version of this. The use case being that I could see which operation the caused the string transformation train to derail.
Ah I see, but that would be polystring, not bistring :). More seriously, I am considering adding a data type that would retain an entire history of transformations, rather than just the initial and final states. The Emacs region-specific undo buffer stuff seems to have that, for example, but I'm not sure what encoding they use. I imagine it's a persistent stack of ropes or something.