rake-nltk
rake-nltk copied to clipboard
Domain names treated as sentences
If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".
Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:
from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
def _generate_phrases(self, sentences):
phrase_list = set()
for sentence in sentences:
word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list))
return phrase_list
Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.
@nsehwan: I am open to any extension to the package as long as the following are met:
- It is a problem for the vast majority.
- The solution to the problem can be made generic enough.
Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.
Thanks @csurfer for the information, working on your suggestions
Sorry for my evanesce !!! After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.
get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.
`def get_sanitized_word_list(data): result = [] word = ''
for char in data:
if char not in string.whitespace:
if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789": #List of whatever could be within or at start/end of words
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
if word != '':
result.append(word)
word=''
return result`
It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.