Analysing big text reviews is too slow

Open sravan7 opened this issue 8 years ago • 3 comments

Someone help me on this

tr = is the CSV data with 38932 lines

I am trying to analyze sentiment in those reviews but it's taking almost 38 hrs in I7 6Gb Ram, hp envy

even I tried to spiting the data and threading, this also didn't work

`c = [ ] print('underthe training') v = 0

for i in tr :

temp = re.sub('[^\w\s'+'.'+']', '', i)     #this for cleaning unwanted char
 
#I am facing the probelm here 
hol = TextBlob(temp, analyzer = NaiveBayesAnalyzer()) 



t = [temp, hol.sentiment.classification]
c.append(t)
v=v+1
print(v)`

@sloria @textblob

Nov 08 '17 13:11 sravan7

TextBlob is training the classifier on each iteration of the loop. Try creating an instance of Blobber once (which invokes training), and then run it against each string of text in your loop.

Like this:

from textblob.sentiments import NaiveBayesAnalyzer
from textblob import Blobber
import re

data = open('reviews.csv', 'r').readlines()
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

for row in data:
    text = re.sub('[^\w\s'+'.'+']', '', row)

    blob = blobber(text) 
    result = [text, blob.sentiment.classification]
    print result

Similar answer on Stackoverflow.

Nov 24 '17 02:11 jschnurr

Now problem with the cl = NaiveBayesClassifier(train) I tried this one for 10,000 lines didn't work laptop got hanged. It worked smoothly for 1000 lines. Is there any other method? @jschnurr @sloria

Nov 26 '17 02:11 sravan7

Here are my timeit results for training sets of various sizes, using nltk.corpus.movie_reviews data and the nltk.classify.NaiveBayesClassifier.train(data) method:

2000: 2.27268099785 4000: 3.63234615326 10000: 7.66381907463 20000: 14.5906889439 40000: 28.350331068

It seems to handle more lines gracefully. Perhaps there is a problem with your input data?

Jan 20 '18 16:01 jschnurr