OKR icon indicating copy to clipboard operation
OKR copied to clipboard

Artifacts from cleaning?

Open gabrielStanovsky opened this issue 7 years ago • 5 comments

Some sentences seem very weird: For example:

Boy Scouts ' ` perversion ' files set to be released travel

@rachelvov, What's the original tweet? @kleinay, do you know the tweet id?

gabrielStanovsky avatar Aug 15 '17 10:08 gabrielStanovsky

this is the original tweet: 258960355655548930 20007 Boy Scouts' 'perversion' files set to be released #travel http://t.co/RrOpIc2Y

Seems that the cleaning process removed the #

On Tue, Aug 15, 2017 at 1:19 PM, Gabriel Stanovsky <[email protected]

wrote:

Assigned #18 https://github.com/vered1986/OKR/issues/18 to @rachelvov https://github.com/rachelvov.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vered1986/OKR/issues/18#event-1206230877, or mute the thread https://github.com/notifications/unsubscribe-auth/AP92_xbtINdu2TIbwezAufoAaOUJVfZjks5sYXChgaJpZM4O3a_T .

rachelvov avatar Aug 15 '17 10:08 rachelvov

@OriShapira, maybe we can remove hashtags altogether.

gabrielStanovsky avatar Aug 15 '17 10:08 gabrielStanovsky

I think the reason I removed only the "#" and not the word after it is that it sometimes appears in the middle of the tweet with a content word, like here: #Baghdad: 25 killed as car bomb targets #Iraq army recruits It's worth checking how often this happens.

rachelvov avatar Aug 15 '17 10:08 rachelvov

Maybe it's applicable to remove hashtags at the end of tweets?

gabrielStanovsky avatar Aug 15 '17 10:08 gabrielStanovsky

Who is the owner of the pre-processing step (cleaning the tweets)? @OriShapira ? we should perhaps integrate the pre-process code into the project?

kleinay avatar Oct 01 '17 08:10 kleinay