CRF icon indicating copy to clipboard operation
CRF copied to clipboard

Feature/optional lowercase

Open MRudolph opened this issue 10 years ago • 1 comments

This pull request extends the command line tool for training and evaluation (iitb.Segment.Segment) by making downcasing of tokens optional.

This seems to be a destructive action, because it's done before the features are generated. Some languages (e.g. german) depend on capitalisation for distinguishing words, so this might be a valuable resource which should not removed.

For not breaking existing setups, there are new methods which can handle the optional downcasing. It's on by default, but can switched off by adding "lowercase=false" to the configuration.

Tests are included and succeed (they are modified copies of the tests for the original tests).

Running the applications with the sample dataset also seems to work fine.

MRudolph avatar Oct 13 '14 14:10 MRudolph

@MRudolph Sorry for late reply. This is a big change, I need more testing, may need more time to review the code

witgo avatar Oct 21 '14 14:10 witgo