pysentimiento icon indicating copy to clipboard operation
pysentimiento copied to clipboard

Add hashtag segmentation with hashformers

Open ruanchaves opened this issue 3 years ago • 9 comments

Closes #23 .

Usage:

from pysentimiento.preprocessing import preprocess_tweet
from pysentimiento.segmenter import create_segmenter

# Handles hashtags
segmenter = create_segmenter(lang="es", batch_size=1000)
preprocess_tweet("esto es #UnaGenialidad", segmenter=segmenter)
# "esto es una genialidad"

create_segmenter(lang="en") or calling a GPT-2 model directly ( e.g. create_segmenter(model_name="gpt2-large") ) are also implemented. Calling preprocess_tweet without a segmenter will run the default camel case segmenter.

I have also modified preprocess_tweet to handle both strings and lists of strings.

P.S.: If you are going to evaluate this segmenter on downstream tasks, make sure you also test create_segmenter(lang="en") on Spanish text. This returns a distilgpt2 which has achieved good results at segmenting hashtags in other languages. Model size doesn't seem to matter much ( distilgpt2 will usually give similar or even better results than gpt2 or gpt2-large ).

ruanchaves avatar Mar 11 '22 06:03 ruanchaves

Great work! Give me a couple of days and I'll check this

finiteautomata avatar Mar 14 '22 11:03 finiteautomata

First of all, sorry @ruanchaves for the long delay with this.

I'm adding hashformers to the poetry project and I noticed that a dependency does not support python 3.8.

Having in mind that some libraries deprecated 3.7, and also that Google Colab is still running that version, is there any possibility to support both 3.7 and 3.8?

finiteautomata avatar Apr 07 '22 19:04 finiteautomata

Hi @finiteautomata , I have just added support for python 3.8. You should have no problem installing the latest version of hashformers in a clean environment with Python 3.8.

I have just tested installing the latest version with conda and it works fine.

conda create -n py38env python=3.8
conda activate py38env
python -m pip install -U pip wheel setuptools
python -m pip install --no-cache-dir hashformers==1.2.8

Please let me know if you run into any more bugs.

ruanchaves avatar Apr 08 '22 12:04 ruanchaves

Sorry for bringing no news about this: I've been overwhelmed by the end of my PhD.

I think I have to pause this for a month at least, but definitely my intention is to add it to the next version of the library (and hopefully, to the paper too)

finiteautomata avatar May 06 '22 13:05 finiteautomata

No problem, you can take your time. The library is quite stable now, and I don't plan on making changes to the interface in the short term.

ruanchaves avatar May 06 '22 13:05 ruanchaves

Hi @ruanchaves, slowly getting back to this. I've created a very simple performance check in this notebook:

https://colab.research.google.com/drive/1u5KMJXysJOjWeTv4Pi5AMfjvoMbHcvsk#scrollTo=F4jwdltzAl7o

Analyzing a single sentence (for Sentiment Analysis) with pysentimiento takes ~75ms on CPU while splitting the hashtag alone takes about 22s – that is, analyzing a tweet with a hashtag would imply a 300x slowdown.

While I can see the benefits of a state-of-the-art segmentation, I think that pysentimiento should be able to run decently on CPU. Having this in mind, I think it is not a good idea to add another model for hashtag segmentation.

However, I've seen that your library features some other segmentation algorithms that could be as fast as the current implementation with some performance boost (this one?). Do you have any example on how to use the other algorithms?

finiteautomata avatar Jul 14 '22 23:07 finiteautomata

@finiteautomata If you want something that runs fast on CPU, just use FastWordSegmenter instead of TransformerWordSegmenter.

from hashformers import FastWordSegmenter as WordSegmenter

ws = WordSegmenter(
    unigram_lang = "es",
    reranker_model_name_or_path=None
)

Both classes have the same interface: hashtags can be segmented by calling ws.segment(hashtags).

I'm going to modify this PR to use FastWordSegmenter instead of TransformerWordSegmenter.

ruanchaves avatar Jul 18 '22 19:07 ruanchaves

Great! Let me check it in these days.

I have to retrain models to check there are no performance loss, and add some other integration tests.

finiteautomata avatar Jul 18 '22 20:07 finiteautomata

I'm trying to run the example you put in the README but it seems it is not splitting the hashtag

from pysentimiento.preprocessing import preprocess_tweet
from pysentimiento.segmenter import create_segmenter

segmenter = create_segmenter(lang="es")

preprocess_tweet("esto es #UnaGenialidad", segmenter=segmenter)
>> "esto es unagenialidad"

What should I do to fix this?

finiteautomata avatar Aug 04 '22 20:08 finiteautomata