gedcom icon indicating copy to clipboard operation
gedcom copied to clipboard

Improve parsing of names that include diacritics

Open tukusejssirs opened this issue 4 years ago • 3 comments

As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő).

As we talked there, Lingua::EN::NameParse (which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse has the following notes in its perlpod docs:

FUTURE DIRECTIONS

Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.

BUGS

Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.

So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:

  • remove the diacritics (MáriaMaria),
  • parse the names as usual,
  • replace the parsed names with their original form (MariaMária).

However, it would be much better to implement Lingua::SK::NameParse as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)

tukusejssirs avatar Oct 19 '19 13:10 tukusejssirs