Tom Morris
Tom Morris
~~It's probably worth replacing the Guava com.google.common.base.CharMatcher calls used for whitespace stripping as well to minimize external dependencies.~~ Actually, that's a bad idea because Java's definition of "whitespace" doesn't match...
I agree with @ostephens analysis. I had forgotten that Java's definition of whitespace doesn't match the Unicode Consortium's definition, making `strip()` unsuitable for OpenRefine's use. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#trim() https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#strip() https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(int)
Tokenizers for search and NLP tend to produce results very different from what a data wrangler would want if they were trying to preserve the original form. The current state...
Here's an example from the Stanford NLP page: > $ cat >sample.txt > "Oh, no," she's saying, "our $400 blender can't handle something this hard!" > $ java edu.stanford.nlp.process.PTBTokenizer sample.txt...
This is largely a duplicate, but superset, of #557, first opened in 2012.
Folks seem to have settled on UAX29 as the preferred tokenization algorithm for both NLP & search, which makes sense since the Unicode Consortium should know what they're doing in...
Good spelunking! It looks like the regression wasn't caught during the code review, but seems easy to fix.
Why are developers still being allowed to introduce code which isn't internationalized? Wouldn't it be a lot simpler to make it part of the coding standards that everything must be...
@ralvessa We already had a "Add 'Finished-Reading' date" on the enhancement list, but I've extended it to include started reading as well. I suspect we might have this date stored...
I have several reservations about this: * The Kaggle datasets are already indexed * These "data sets" are web scrapes, not novel data * With a CC-NC license they're arguably...