Add localization (and reduce regex duplication)

Open TBG-FR opened this issue 4 years ago • 0 comments

Hello again,

While working and searching in the Parser/EmailParser.php file, I thought about something:

Instead of having things like

private $quoteHeadersRegex = array(
    '/^.{0,5}(On(?:(?!\bOn\b|\bwrote(\s|\xc2\xa0)?:).){0,1000}wrote(\s|\xc2\xa0)?:)$/ms', // On DATE, NAME <EMAIL> wrote:
    '/^.{0,5}(Le\b(?:(?!\bLe\b|\bécrit(\s|\xc2\xa0)?:).){0,1000}écrit(\s|\xc2\xa0)?:)$/ms', // Le DATE, NAME <EMAIL> a écrit :
    '/^.{0,5}(El(?:(?!\bEl\b|\bescribió\s?:).){0,1000}escribió\s?:)$/ms', // El DATE, NAME <EMAIL> escribió:
    '/^.{0,5}(El(?:(?!\bEl\b|\bha escrit\s?:).){0,1000}ha escrit\s?:)$/ms', // El DATE, NAME <EMAIL> ha escrit:
    '/^.{0,5}(Il(?:(?!\bIl\b|\bscritto(\s|\xc2\xa0)?:).){0,1000}scritto(\s|\xc2\xa0)?:)$/ms', // Il DATE, NAME <EMAIL> ha scritto:
    [...]
    '/^\s*(From\s?:.+\s?(\[|<).+(\]|>))/mu', // "From: NAME <EMAIL>" OR "From : NAME <EMAIL>" OR "From : NAME<EMAIL>"(With support whitespace before start and before <)
    '/^\s*(发件人\s?:.+\s?(\[|<).+(\]|>))/mu', // "发件人: NAME <EMAIL>" OR "发件人 : NAME <EMAIL>" OR "发件人 : NAME<EMAIL>"(With support whitespace before start and before <)
    '/^\s*(De\s?:.+\s?(\[|<).+(\]|>))/mu', // "De: NAME <EMAIL>" OR "De : NAME <EMAIL>" OR "De : NAME<EMAIL>"  (With support whitespace before start and before <)
    '/^\s*(Van\s?:.+\s?(\[|<).+(\]|>))/mu', // "Van: NAME <EMAIL>" OR "Van : NAME <EMAIL>" OR "Van : NAME<EMAIL>"  (With support whitespace before start and before <)
    '/^\s*(Da\s?:.+\s?(\[|<).+(\]|>))/mu', // "Da: NAME <EMAIL>" OR "Da : NAME <EMAIL>" OR "Da : NAME<EMAIL>"  (With support whitespace before start and before <)
    [...]
);

couldn't we have only one variabilized line for each "type" of reply like that (of course it's only a draft):

private $quoteHeadersRegex = array(
    '/^.{0,5}($on(?:(?!\b$on\b|\b$wrote(\s|\xc2\xa0)?:).){0,1000}$wrote(\s|\xc2\xa0)?:)$/ms', // On DATE, NAME <EMAIL> wrote:
    [...]
    '/^\s*($from\s?:.+\s?(\[|<).+(\]|>))/mu', // "From: NAME <EMAIL>" OR "From : NAME <EMAIL>" OR "From : NAME<EMAIL>"(With support whitespace before start and before <)
    [...]
);

Then we would run these Regex checks using a list of language files, so for example $wrote would be checked with "wrote", then "a écrit", then "escribió", ...

Here are the advantages I see in that modification:

Adding a new language or variation is easier
You don't have to duplicate X times the same Regex, modifying one or two words each time
You're less likely to make a mistake in a Regex

That was my two cents, thanks for reading 😉

Aug 04 '21 09:08 TBG-FR