SoMaJo icon indicating copy to clipboard operation
SoMaJo copied to clipboard

support custom abbreviation

Open krambox opened this issue 1 year ago • 1 comments

Currently, adjustments to the abbreviations can only be made directly in the data directory. In order to use SoMaJo also for domain-specific texts with own abbreviations, the constructor has been extended so that own abbreviations can be used without fork in SoMaJo.

krambox avatar May 07 '24 09:05 krambox

Thanks, this is something that has been requested a couple of times!

Before I merge it into develop, could you please address the following minor issues?

  • Add a space before the commas
  • Change the default value of custom_abbreviations to None (to avoid mutable default arguments)
  • Check the indentation level in TesttCustomAbbreviation
  • Fix the typo in TesttCustomAbbreviation

TODOs (intended as reminders to myself) until it can be merged into master and released:

  • Update the docstrings
  • When merging the custom abbreviations with the default list, check for duplicates and sort all abbreviations by length (it’s probably best to pass the custom abbreviations to utils.read_abbreviation_file() as additional argument and initialize the abbreviations set with them, respecting to_lower)
  • Add an argument custom_single_token_abbreviations for abbreviations that should not be split (corresponding to the single_token_abbreviations_*.txt files)
  • Add the functionality to the command-line interface, e.g. via options that let the user provide custom abbreviation files

tsproisl avatar May 14 '24 09:05 tsproisl