SoMaJo
SoMaJo copied to clipboard
support custom abbreviation
Currently, adjustments to the abbreviations can only be made directly in the data directory. In order to use SoMaJo also for domain-specific texts with own abbreviations, the constructor has been extended so that own abbreviations can be used without fork in SoMaJo.
Thanks, this is something that has been requested a couple of times!
Before I merge it into develop, could you please address the following minor issues?
- Add a space before the commas
- Change the default value of
custom_abbreviationstoNone(to avoid mutable default arguments) - Check the indentation level in
TesttCustomAbbreviation - Fix the typo in
TesttCustomAbbreviation
TODOs (intended as reminders to myself) until it can be merged into master and released:
- Update the docstrings
- When merging the custom abbreviations with the default list, check for duplicates and sort all abbreviations by length (it’s probably best to pass the custom abbreviations to
utils.read_abbreviation_file()as additional argument and initialize theabbreviationsset with them, respectingto_lower) - Add an argument
custom_single_token_abbreviationsfor abbreviations that should not be split (corresponding to thesingle_token_abbreviations_*.txtfiles) - Add the functionality to the command-line interface, e.g. via options that let the user provide custom abbreviation files