AbTextSumm icon indicating copy to clipboard operation
AbTextSumm copied to clipboard

Not "production" ready, plus a bug.

Open alex-pavlides-adarga opened this issue 5 years ago • 0 comments

Thanks for your code. I think it's a good start, but there is a lot of tidying up needed to make this "production" ready. I seem to be able to cut out a huge amount of code and this doesn't effect running the example, I've seen #TODO's that arn't clear (i.e. FIXME len(s) > 1: SUCH A SHAME!!!) as well as other comments (i.e. #NOT USING THIS NOW: THIS is for IGRAPH). There is also a bug, which is what this issue is really about. You don't preprocess your stop words in the same manner as your documents. This throws a warning when running the code. I fixed it by adding in a preprocessing step to your stop words before passing them to StemmedTfidfVectorizer.

`
preprocessed_stop_words=[]
tf = TfidfVectorizer()
preprocess = tf.build_preprocessor()
tokenize = tf.build_tokenizer()

for w in stopwords:
    p = preprocess(w)
    tokens = tokenize(p)
    preprocessed_stop_words.append(tokens)

flat_preprocessed_stop_words = [item for sublist in preprocessed_stop_words for item in sublist]

bow_matrix = StemmedTfidfVectorizer(stop_words=flat_preprocessed_stop_words).fit_transform(docs)

`

As this code looks almost ready it would be a shame not to polish it up. Also why not throw in the sentence clustering part also since it's a major aspect of your paper? It would be great if it was possible to reproduce the results of your paper with ease. Or perhaps to also substitute the dataset more easily for a different dataset i.e. multi_news (which is something i'm working on).

alex-pavlides-adarga avatar Feb 18 '20 13:02 alex-pavlides-adarga