CRF
CRF copied to clipboard
Feature/optional lowercase
This pull request extends the command line tool for training and evaluation (iitb.Segment.Segment) by making downcasing of tokens optional.
This seems to be a destructive action, because it's done before the features are generated. Some languages (e.g. german) depend on capitalisation for distinguishing words, so this might be a valuable resource which should not removed.
For not breaking existing setups, there are new methods which can handle the optional downcasing. It's on by default, but can switched off by adding "lowercase=false" to the configuration.
Tests are included and succeed (they are modified copies of the tests for the original tests).
Running the applications with the sample dataset also seems to work fine.
@MRudolph Sorry for late reply. This is a big change, I need more testing, may need more time to review the code