dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

Several problems with AsvToolboxSplitterAlgorithm's handleLastSplit() method

Open reckart opened this issue 10 years ago • 0 comments

What version of the product are you using? On what operating system?
Relates to version currently browsable in google code. Linux

Issue 1.
boolean isInvStartsWith defined on line 418 is not used in any case, can be removed.
This 'isInvStartsWith' cannot happen as we always input the last segment of the compound
as aSplit to the method - so either aSplit.startsWith(rest) or aSplit.startsWith(restGrund)

What steps will reproduce the problem?
1. comment out isInvStartsWith
2. test on a bunch of examples
3. no difference in behavior

What is the expected output? What do you see instead?

Output is fine, this is a sanity issue. If that variable has a role, I misunderstood
the code completely.

Issue 2.
in line 416, isEqual should be:
boolean isEqual = /*aSplit.equals(restGrund) ||*/ aSplit.equals(rest);
i.e. not consider the 'equals restGrund' case.

This way, the last part of the compound is never lemmatized, which is, if desired then
this is a non issue, but I find it counter intuitive (as typically the last part of
the noun is what gets inflected...).
Sometimes, equality check also prevents reducing the inner part (see 2nd example below)

What steps will reproduce the problem?
1. comment out the part above
2. test on a bunch of examples
3. difference in behavior that last part of the compound gets lemmatized. I think this
is desirable, and the inflection can be dropped entirely (as it is not a linking morpheme
that should be annotated, but a standard inflection).

What is the expected output? What do you see instead?
    INPUT   DESIRED (in my opinion) OBSERVED
1.  Bankdienstleistungen    Bank+dienst+leistung    Bank+dienst+leistungen
2.  Fußbodenschleifmaschinenverleih    Fuß+boden+schleif+maschine+(n)+verleih Fuß+boden+schleif+maschinen+verleih
3.  Halsschmerzen   Hals+schmerz    Hals+schmerzen
4.  Klimaschutzzielen   Klima+schutz+ziel   Klima+schutz+zielen
5.  Kopfschmerzen       Kopf+schmerz    Kopf+schmerzen


Issue 3.
If Issue 2 is approved and changed, this surfaces a bug in line 436 (which was not
an issue before, when last part does not get lemmatized).

Namely, that this line assumes that the reduced (lemma) form is always strictly shorter
or equal length as the inflected form. This is not always true, see below.

What steps will reproduce the problem?
1. Implement the change suggested in Issue 2, i.e. remove equals(restGrund) check.
2. test with "Betriebsmodi"
3. substring throws a StringIndexOutOfBoundsException

What is the expected output? What do you see instead?
Betriebsmodi    Betrieb+(s)+modus
isntead: exception thrown.

Fix: add a check around line 436:
                    //there is something at the end, this is not true for irregular
cases where
                    //inflected form gets shortened: "modus" --> "modi" (plural)
                    if (rest.length() > restGrund.length()) {
                        retvec.add("(" + rest.substring(restGrund.length()) + ")");
                    }

Original issue reported on code.google.com by [email protected] on 2014-12-22 11:03:26

reckart avatar May 12 '15 22:05 reckart