andaluh-py
andaluh-py copied to clipboard
Problem with dotless i
There is a problem when trying to transcript a word like "Cacık" that contains a special char, in this case, dotless i.
I reviewed the code where the exception raises:
def replace_const_end_with_case(match):
repl_rules = {
'a': 'â', 'A': 'Â', 'á': 'â', 'Á': 'Â',
'e': 'ê', 'E': 'Ê', 'é': 'ê', 'É': 'Ê',
'i': 'î', 'I': 'Î', 'í': 'î', 'Í': 'Î',
'o': 'ô', 'O': 'Ô', 'ó': 'ô', 'Ó': 'Ô',
'u': 'û', 'U': 'Û', 'ú': 'û', 'Ú': 'Û'
}
word = match.group(0)
prefix = match.group(1)
suffix_vowel = match.group(2)
suffix_const = match.group(3)
else_cond = any(
s in prefix
for s in ('á', 'é', 'í', 'ó', 'ú', 'Á', 'É', 'Í', 'Ó', 'Ú'))
if word.lower() in list(WORDEND_CONST_RULES_EXCEPT.keys()):
return keep_case(word, WORDEND_CONST_RULES_EXCEPT[word.lower()])
elif else_cond:
return prefix + repl_rules[suffix_vowel]
else:
if suffix_const.isupper():
return prefix + repl_rules[suffix_vowel] + 'H'
else:
return prefix + repl_rules[suffix_vowel] + 'h' # <--- EXACTLY HERE
I think that there are two choices to solve this:
- Add more entries to the repl_rules map
- Filter out words that contain characters that are not normally used in Spanish.
I'll check other kind of chars to learn more about this problem.