rake-nltk icon indicating copy to clipboard operation
rake-nltk copied to clipboard

Domain names treated as sentences

Open quantoid opened this issue 7 years ago • 4 comments

If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".

quantoid avatar Jul 09 '18 05:07 quantoid

Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:

from nltk.tokenize import TweetTokenizer, sent_tokenize tokenizer_words = TweetTokenizer() def _generate_phrases(self, sentences): phrase_list = set() for sentence in sentences: word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)] phrase_list.update(self._get_phrase_list_from_words(word_list)) return phrase_list Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.

ghost avatar Aug 08 '18 01:08 ghost

@nsehwan: I am open to any extension to the package as long as the following are met:

  1. It is a problem for the vast majority.
  2. The solution to the problem can be made generic enough.

Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.

csurfer avatar Aug 09 '18 14:08 csurfer

Thanks @csurfer for the information, working on your suggestions

ghost avatar Aug 14 '18 17:08 ghost

Sorry for my evanesce !!! After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.

get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.

`def get_sanitized_word_list(data): result = [] word = ''

for char in data:
	if char not in string.whitespace:
		if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789":	#List of whatever could be within or at start/end of words
			if word:
				result.append(word)
			result.append(char)
			word = ''
		else:
			word = ''.join([word,char])

	else:
		if word:
			result.append(word)
			word = ''
if word != '':
	result.append(word)
	word=''
return result`

It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.

ghost avatar Nov 15 '18 19:11 ghost