gedcom
gedcom copied to clipboard
Improve parsing of names that include diacritics
As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő
).
As we talked there, Lingua::EN::NameParse
(which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse
has the following notes in its perlpod
docs:
Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.
Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.
So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:
- remove the diacritics (
Mária
→Maria
), - parse the names as usual,
- replace the parsed names with their original form (
Maria
→Mária
).
However, it would be much better to implement Lingua::SK::NameParse
as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse
) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)