sumgram
sumgram copied to clipboard
Fix `InvalidParameterError` on `CountVectorizer`
This looks like a really interesting project! I was trying to play around with it, first by using the examples in the README.md
and kept running into an InvalidParameterError error.
The Example I was trying:
import json
from sumgram.sumgram import get_top_sumgrams
doc_lst = [
{'id': 0, 'text': 'The eye of Category 4 Hurricane Harvey is now over Aransas Bay. A station at Aransas Pass run by the Texas Coastal Observing Network recently reported a sustained wind of 102 mph with a gust to 132 mph. A station at Aransas Wildlife Refuge run by the Texas Coastal Observing Network recently reported a sustained wind of 75 mph with a gust to 99 mph. A station at Rockport reported a pressure of 945 mb on the western side of the eye.'},
{'id': 1, 'text': 'Eye of Category 4 Hurricane Harvey is almost onshore. A station at Aransas Pass run by the Texas Coastal Observing Network recently reported a sustained wind of 102 mph with a gust to 120 mph.'},
{'id': 2, 'text': 'Hurricane Harvey has become a Category 4 storm with maximum sustained winds of 130 mph. Sustained hurricane-force winds are spreading onto the middle Texas coast.'}
]
'''
Use 'add_stopwords' to include list of additional stopwords not included in stopwords list (https://github.com/oduwsdl/sumgram/blob/0224fc9d54034a25e296dd1c43c09c76244fc3c2/sumgram/util.py#L31)
'''
params = {
'top_sumgram_count': 10,
'add_stopwords': ['image'],
'no_rank_sentences': True,
'title': 'Top sumgrams for Hurricane Harvey text collection'
}
ngram = 2
sumgrams = get_top_sumgrams(doc_lst, ngram, params=params)
with open('sumgrams.json', 'w') as outfile:
json.dump(sumgrams, outfile, indent=2)
I think the CountVectorizer
requires a string
, list
or None
and you were supplying a set
. I just cast it to a list. Not sure if this is an a real issue (didn't see it in any current Issues) or something I messed up on my part but I thought I'd submit it incase it could help.