cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

String.join includes an extra copy of separator at the end of the string.

Open mayhewsw opened this issue 7 years ago • 3 comments

https://github.com/CogComp/cogcomp-nlp/blob/ce0f3a03264d293cd751d7a416bbadd858066496/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringUtils.java#L60

There is a .trim() call, but this only trims whitespace. If the separator is non-whitespace (e.g. underscore, dash, period, bar, etc.), then trim doesn't work.

Also, Apache Commons has a StringUtils that works very nicely. Perhaps we want to remove this in favor of that.

mayhewsw avatar Dec 13 '17 20:12 mayhewsw

I agree that we should replace cogcomp's stringutils with the apache commons library. Probably a fair number of uses throughout the package, though.

mssammon avatar Sep 06 '18 03:09 mssammon

Currently there are some feature generators in Edison that calls the StringUtils.join() to generate feature, which includes an extra copy of dash in the feature name. Example below. https://github.com/CogComp/cogcomp-nlp/blob/7d9dad3fedc16ac59feb278815e27dc195d1367e/edison/src/main/java/edu/illinois/cs/cogcomp/edison/features/helpers/FeatureNGramUtility.java#L55

Switching to Apache Commons will change the feature names, and so break models that relies on Edison (PrepSRL, I think). What's the best thing to do here? @mssammon

schen149 avatar Sep 10 '18 20:09 schen149

probably, requires retraining. Create a branch "requires_retrain" and make the change on this branch. Once implemented, create a new issue for prepsrl retraining and refer to it in the branch notes. presumably, it is sufficient to retrain prepsrl models and deploy, then update the dependencies on the branch; at that point, it can be merged. (You are not obligated to take that new prepsrl issue, but can if you want.)

cogcomp-dev avatar Sep 10 '18 20:09 cogcomp-dev