Analysing big text reviews is too slow
Someone help me on this
tr = is the CSV data with 38932 lines
I am trying to analyze sentiment in those reviews but it's taking almost 38 hrs in I7 6Gb Ram, hp envy
even I tried to spiting the data and threading, this also didn't work
`c = [ ] print('underthe training') v = 0
for i in tr :
temp = re.sub('[^\w\s'+'.'+']', '', i) #this for cleaning unwanted char
#I am facing the probelm here
hol = TextBlob(temp, analyzer = NaiveBayesAnalyzer())
t = [temp, hol.sentiment.classification]
c.append(t)
v=v+1
print(v)`
@sloria @textblob
TextBlob is training the classifier on each iteration of the loop. Try creating an instance of Blobber once (which invokes training), and then run it against each string of text in your loop.
Like this:
from textblob.sentiments import NaiveBayesAnalyzer
from textblob import Blobber
import re
data = open('reviews.csv', 'r').readlines()
blobber = Blobber(analyzer=NaiveBayesAnalyzer())
for row in data:
text = re.sub('[^\w\s'+'.'+']', '', row)
blob = blobber(text)
result = [text, blob.sentiment.classification]
print result
Similar answer on Stackoverflow.
Now problem with the cl = NaiveBayesClassifier(train) I tried this one for 10,000 lines didn't work laptop got hanged. It worked smoothly for 1000 lines. Is there any other method? @jschnurr @sloria
Here are my timeit results for training sets of various sizes, using nltk.corpus.movie_reviews data and the nltk.classify.NaiveBayesClassifier.train(data) method:
2000: 2.27268099785 4000: 3.63234615326 10000: 7.66381907463 20000: 14.5906889439 40000: 28.350331068
It seems to handle more lines gracefully. Perhaps there is a problem with your input data?