Thomas Proisl

Results 26 comments of Thomas Proisl

Thank you for reporting this. It’s difficult to reproduce the problem. Could you provide an example input file for which the error occurs?

Hi, unfortunately I’m not familiar with Rust or with SRX rules. From the link you posted, it seems like it should be possible to express SoMaJo’s sentence splitting rules in...

The sentence splitter operates on tokenized input, so splitting sentences without first tokenizing the text is not possible. However, there are two ways to extract untokenized sentences from SoMaJo's output....

Sorry for the delayed response. Abbreviations are defined in `src/somajo/data`: - `abbreviations_(de|en).txt`: Abbreviations that are not matched by `(?:[[:alpha:]]\.){2,}`, i.e. are not sequences of single letters followed by single dots....

Thanks, this is something that has been requested a couple of times! Before I merge it into develop, could you please address the following minor issues? - Add a space...

Yeah, this would break things like "Es sind noch ungefähr 5km." At a first glance this might turn out to be tricky.