parse-latin icon indicating copy to clipboard operation
parse-latin copied to clipboard

Using custom prefix exceptions

Open gorango opened this issue 4 years ago • 2 comments

Certain texts can contain abbreviations that are not captured by the regex tests.

I'm curious if this might be worth adding as an option on the constructor - instead of having to rely on extending the plugins (the way ParseEnglish does):

class ParseEnglish extends ParseLatin {}

ParseEnglish.prototype.tokenizeParagraphPlugins = [
  modifyChildren(mergeEnglishPrefixExceptions)
].concat(ParseEnglish.prototype.tokenizeParagraphPlugins)

I was originally going to open an issue on the parse-english repo but I thought it might be worth exploring the option here for a few reasons:

  • ParseLatin can be reliably used for most European languages with abbreviations being the main barrier to accurate sentence tokenization.
  • Exceptions can be supplied as a string array - providing a simpler interface for i18n support of out the box.

I'm not sure if the juice is worth the squeeze for a feature like this (especially since a solution exists). But it may be worthwhile to expose the mergePrefixExceptions function and to document this particular use case for others in the future.

gorango avatar Oct 19 '21 16:10 gorango

Hey there!

  • What are the exceptions?
  • If the goal is to support different languages, are exceptions enough? Would more extensions be needed and possible? Couldn’t a parse-french be made?

wooorm avatar Oct 21 '21 11:10 wooorm

For example, academic papers might use "Fig." when referring to tables, charts, etc. Corporate content often contains abbreviations like "dept." or "govt.". And other edge cases can come up, which might not be part of a formal spec.

I intended to add some of these exceptions and tests in a PR in parse-english but I quickly found a solution using plugins. I also wasn't sure if "fig" should be handled in parse-latin or -english... but I'd be happy to make the PR(s) if you think it's worthwhile.

Ultimately, I agree that language-specific parsers would be the most semantic way to handle i18n - since exceptions are only a part of the equation (I may have overestimated their significance).

gorango avatar Oct 21 '21 15:10 gorango

Thanks for your patience. I am open to such an API. Preferrably as discussed a clean, shared, API that works with the different projects. I am closing this though, as I think it’s a nice to have, that I personally am not currently interested in working on! But, let me know if you (or someone else?) is interested in working on this!

wooorm avatar Nov 20 '22 19:11 wooorm