dkpro-core
dkpro-core copied to clipboard
Several problems with AsvToolboxSplitterAlgorithm's handleLastSplit() method
What version of the product are you using? On what operating system?
Relates to version currently browsable in google code. Linux
Issue 1.
boolean isInvStartsWith defined on line 418 is not used in any case, can be removed.
This 'isInvStartsWith' cannot happen as we always input the last segment of the compound
as aSplit to the method - so either aSplit.startsWith(rest) or aSplit.startsWith(restGrund)
What steps will reproduce the problem?
1. comment out isInvStartsWith
2. test on a bunch of examples
3. no difference in behavior
What is the expected output? What do you see instead?
Output is fine, this is a sanity issue. If that variable has a role, I misunderstood
the code completely.
Issue 2.
in line 416, isEqual should be:
boolean isEqual = /*aSplit.equals(restGrund) ||*/ aSplit.equals(rest);
i.e. not consider the 'equals restGrund' case.
This way, the last part of the compound is never lemmatized, which is, if desired then
this is a non issue, but I find it counter intuitive (as typically the last part of
the noun is what gets inflected...).
Sometimes, equality check also prevents reducing the inner part (see 2nd example below)
What steps will reproduce the problem?
1. comment out the part above
2. test on a bunch of examples
3. difference in behavior that last part of the compound gets lemmatized. I think this
is desirable, and the inflection can be dropped entirely (as it is not a linking morpheme
that should be annotated, but a standard inflection).
What is the expected output? What do you see instead?
INPUT DESIRED (in my opinion) OBSERVED
1. Bankdienstleistungen Bank+dienst+leistung Bank+dienst+leistungen
2. Fußbodenschleifmaschinenverleih Fuß+boden+schleif+maschine+(n)+verleih Fuß+boden+schleif+maschinen+verleih
3. Halsschmerzen Hals+schmerz Hals+schmerzen
4. Klimaschutzzielen Klima+schutz+ziel Klima+schutz+zielen
5. Kopfschmerzen Kopf+schmerz Kopf+schmerzen
Issue 3.
If Issue 2 is approved and changed, this surfaces a bug in line 436 (which was not
an issue before, when last part does not get lemmatized).
Namely, that this line assumes that the reduced (lemma) form is always strictly shorter
or equal length as the inflected form. This is not always true, see below.
What steps will reproduce the problem?
1. Implement the change suggested in Issue 2, i.e. remove equals(restGrund) check.
2. test with "Betriebsmodi"
3. substring throws a StringIndexOutOfBoundsException
What is the expected output? What do you see instead?
Betriebsmodi Betrieb+(s)+modus
isntead: exception thrown.
Fix: add a check around line 436:
//there is something at the end, this is not true for irregular
cases where
//inflected form gets shortened: "modus" --> "modi" (plural)
if (rest.length() > restGrund.length()) {
retvec.add("(" + rest.substring(restGrund.length()) + ")");
}
Original issue reported on code.google.com by [email protected] on 2014-12-22 11:03:26