generalized-language-modeling-toolkit
generalized-language-modeling-toolkit copied to clipboard
How to treat reserved symbols in Training and Querying files
Currently reserved symbols are _ (absolute skip), % (continuation skip) / (token-pos-separator).
IIRC the program fails if any of these are contained in training or querying files.
How do we cope with this isse?
Commit 9e4c6a7e740eaa55183431a5748fe31e445054b4 scans corpus for reserved symbols and refuses execution if it contains any.
However I'n the long run I would like to have some form of escaping the input to make it transparent for the user.