bugbug icon indicating copy to clipboard operation
bugbug copied to clipboard

Improve stopword removal in the similarity script

Open marco-c opened this issue 5 years ago • 6 comments

We should check what stopwords nltk is using, see if some of them are actually meaningful for us, and add new ones (e.g. "Firefox" could probably be considered as a stopword for us, since it's everywhere :P).

marco-c avatar Jul 16 '19 09:07 marco-c

@marco-c and @suhaibmujahid I'm picking this up, how do we come up with words that we think should be part of the stopwords or shouldn't?

Amotul-raheem avatar Oct 14 '22 01:10 Amotul-raheem

Feel free to research the way that you see suitable. The following are just suggestions:

  • For stopwords to remove from nltk, you could check if any of them has meaningful technical/domain use.
  • For words that we should add, you could check the most frequent and the least frequent words in our datasets. In addition, you could use reverted IDF if you want. If you think that a word is not meningful, it could be a good candidate to be added as a stopword.

suhaibmujahid avatar Oct 14 '22 17:10 suhaibmujahid

Hi @suhaibmujahid, After spending some time calculating the reverted IDF, I found that there are some domain-specific words that can be added to the stop words. Below are the top 200 words which include both domain and non-domain-specific words, with IDF ranging from 1.323716(lowest) to 3.794723.

['js', 'profile', 'layout', 'don', 'try', '22', 'then', 'work', 'src', 'click', '17', 'remove', 'crash', 'enabled', 'hex', 'application', 'failed', 'base', 'ok', 'x86', 'does', 'make', 'same', '36', '2021', 'issue', 'here', '18', 'could', 'upstream', 'win64', 'finished', 'result', 'need', 'would', 'other', 'update', 'http', '13', 'messages', 'time', '20', 'found', '16', 'any', 'line', 'x64', 'thread', 'name', 'worker', 'wpt', '14', 'took', 'pull', 'warning', 'data', 'pr', 'you', 'tab', 'set', 'show', 'true', 'message', 'follow', 'do', 'linux', 'nt', 'complete', 'dom', 'more', 'builds', 'pass', 'url', 'sync', '64', 'also', '12', 'details', 'main', 'up', 'get', '15', 'number', '11', 'autoland', 'logviewer', 'run', 'using', 'window', 'closed', 'some', 'which', 'add', 'there', 'one', 'out', 'after', 'has', '20100101', 'page', 'process', 'chrome', 'fail', 'attachment', 'content', 'like', 'open', 'about', 'github', 'code', 'see', 'platform', 'so', 'into', 'all', 'use', 'was', 'tc', 'rv', 'only', 'will', 'unexpected', 'parsed', 'build', 'web', 'filed', 'new', 'ci', 'backing', 'start', 'error', 'created', 'queue', 'v1', 'artifacts', 'logs', 'live', 'services', 'runs', 'intermittent', 'public', 'an', 'full', 'windows', 'agent', 'job', 'info', 'api', 'no', 'task', 'browser', 'repo', 'log', 'treeherder', 'if', 'but', 'or', 'have', 'are', 'can', 'results', 'tests', 'user', 'actual', 'id', 'central', 'str', 'we', '10', 'as', 'test', 'gecko', 'expected', 'when', 'at', 'should', 'by', 'that', 'it', 'from', 'bug', 'be', 'not', 'with', 'reference', 'com', 'org', 'file', 'on', 'firefox', 'of', 'for', 'and', 'this', 'https', 'is', 'in', 'mozilla', 'to', 'the']

while the following 200 words appear least frequently, with the IDF ranging from 13.163092(highest) to 12.469945.

['b', 'h', 't', '3', 'e', 'i', 'v', '5', 'u', 'l', 'z', 'k', 'g', 'o', 'w', 'y', '1', 'x', 'd', '0', '8', 'c', 'j', 's', 'q', '7', '9', '4', 'a', 'm', 'r', '6', 'f', 'n', 'p', '2', '1647451970689', 'disksizegb', '0y9sk1q1wfj5d06y0nea', 'disktype', 'disktypes', 'machinetype', 'networkinterfaces', 'cascadelake', 'accessconfigs', 'automaticrestart', 'onhostmaintenance', '7fd68c6f9600', 'devicemanagement', 'mincpuplatform', 'l317', 'accumulatesamples', '0df1eb2b6e985b623152060139f4ebb701dfd021', '371282010', 'tmpbym6cpgdpidlog', 'riu', '1647452067259', '1647452062779', '9268052', 'cabos', '7fd689faab00', 'palace', '9268056', 'tmpi4547b1x', 'microwaves', '1wmroxtrfgpv5vzjrbiwg4oxhdgbrihle', 'saples', '7fd689face00', 'hostsharedmemory', '27prompt', 'afteridleseconds', '9268080', '1647451990785', 'tmpvv3u0s0e', 'a6b5b3dfefb10fb23e5c9a9a1340582572b3ac6a', '390752', '0d20df2b', 'ae9a90220315', '56128', '1750623', '371278818', 'lytmovgessw9a8vgsndshw', 'integrators', 'mcookie', 'onnotificationclick', 'fuzzlesoft', '9268113', '371280640', 'gk8lcv', 'utpsickamu0j5tw', 'cf9488f918cc', 'accumulatepageloadtelemetry', '18db', 'tmpwf49ouft', '7fd691df8e00', '1647452082482', 'alloow', '9268086', '9268089', 'testfactoryimpl', 'goodsurrogatepair', '1647452062500', '36356', '1647451966575', '371271074', 'tmpdt4nmm2s', '1647451959871', 'du9', 'zrvtrrmlqyvmxfb1tq', '164744508432201', 'nkom2sdltdynf7oqlzs7fg', '7b401dab', '180946ms', '371238201', '4fd3', 'ea3703f63c4c', '371218636', 'adkxzbq5rdiriknoielatw', '6641mb', '12533ms', '178156ms', '600000023841858', '371216015', 'zrpgvls3soybwe2pagxe1q', '1757431', '371218855', 't8yeaqgzqu6msadcvrxzeq', '371216458', 'l1vf6z1bqgu', 'caqfkfkbnq', '1647407986', '716168', '002583', '008009', '7d905163', '6p6stqs8', '1917ms', 'zq83r5wnquqsxdwzhm40lq', '1647395176452', '1647395176454', '1647395176463', '1647395176949', '1647395176953', '1647395176955', '1647395176957', '1647395176960', '1647395176', '005426', '1647395176961', '6nhfcjjtbolo', '1647395176965', '1712170', 'd116863', '371210870', 'fwyz3glws7svhyrqpu43xw', '20220316041325', 'a2742e46efb56e290f294f13ab8aaa1a5fe666c4', '371212009', '7pz91hcrlqhl', '055703', '047694', '269519', 'kolejowe', 'opowiesci', 'koko', 'slamazara', 'odc', '58403203', 'chuggington', 's06e01', 'dubbing', 'stacyjkowo', 'gemius', 'aed0e880', 'f307', '2d604bfa5e95', '1753949', 'd138263', 'ufwhr5g3qfg', 'zmp0bi9rs0ss', 'uifcj2n', 'c1edbbf2', 'adocean', '4654c3f89c99', 'yctzjbf2q0y4nreyc2t8ig', 'wjb8tsxkiq3pfhzryrw', '71b74640', '95dd', '61c07f58be6e', 'tmp27mol3f1', '0692f741', '1d88', '861735783b6e', 'rtx3090', 'bc104f15a48be709dca542a7e1d4b9df6f054527e802eaa92d595444258afe71', '371235817', '1759817', '20220315091352', '9267999', 'es2022']

Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.

N.B: This analysis is based on the first comments only. I can extend to all comments if needed

Find attached a file containing each word and its respective IDF and reverted IDF bugs_idf.zip

Amotul-raheem avatar Oct 19 '22 21:10 Amotul-raheem

Thank you @Amotul-raheem ! Very nice work 👍

Some of them have been filtered out in the similarity script logic either because they fall into the NLTK stopwords or they have one character.

Could you please remove these cases?

N.B: This analysis is based on the first comments only. I can extend to all comments if needed

This should be enough for now. You can try if you want and see if the results show a different perspective.

The next step will be to select the best candidates for these lists. Also, we want to check if there is a need to drop some NLTK stopwords.

suhaibmujahid avatar Oct 19 '22 22:10 suhaibmujahid

@suhaibmujahid Thanks, I'll remove those that already exist in NLTK stopwords and the single characters. Also, I forgot to respond to the dropping of some of the NLTK stopwords. After looking at the words(179 words) I think they are fine, I can't see anything that needs to be removed from them. That being said, I'll continue to work on selecting the best candidates for the stopwords.

Thanks!

Amotul-raheem avatar Oct 21 '22 22:10 Amotul-raheem

@suhaibmujahid

After doing some downsizing from the top 1000 most frequent words i.e low idf, these are some words that I think will be good candidates for the stopwords. 'rsi', 'rdx', 'amd64', 'rdi', 'rcx', 'r13', 'r15', 'rax', 'r14', 'rbx', 'rbp', 'r10', 'rsp', 'r12', 'r9', 'var', 'webrtc', 'r11', 'nsithread', 'r8', 'iframe', 'wiki', 'plugin', 'login', 'opt', 'e10s', 'links', 'libxul', 'tree', 'node', 'exe', 'dmp', 'async', 'mach', 'blobber', 'docshell', 'xre', 'gre', 'init', 'dist', 'crashreporter', 'pushloghtml', 'mochikit', 'though', 'appdata', 'every', 'es', 'geckoview', 'temp', 'messageloop', 'etc', 'recv', 'python', 'args', 'addon', 'much', 'website', 'bin', 'self', 'mozrunner', 'around', 'many', 'mochitest', 'char', 'core', 'mochitests', 'cgi', 'nsthread', 'obj', 'addons', 'either', 'string', 'gpu', 'maybe', 'bool', 'simpletest', 'io', 'taskcluster', 'int', 'patch', 'xul', 'macintosh', 'webrender', 'however', 'let', 'gfx', 'ui', 'lib', 'residentfast', 'vsize', 'bugzilla', 'bit', 'might', 'void', 'tmp', 'const', 'bugs', 'ubuntu', 'reftest', 'google', 'macos', 'net', 'applewebkit', 'khtml', 'devtools', 'cc', '0a1', 'xhtml', 'xpcom', 'dll', 'en', 'marionette', 'css', 'toolkit', 'mac', 'android', 'instead', 'safari', 'www', 'javascript', 'moz', 'x11', 'ipc', 'searchfox', 'ns', 'html', 'pid', 'os', 'console', 'hg', 'chromium', 'js', 'hex', 'application', 'x86', 'issue', 'could', 'win64', 'would', 'http', 'x64', 'wpt', 'pr', 'linux', 'nt', 'dom', 'builds', 'url', 'sync', 'also', 'autoland', 'logviewer', 'window', 'chrome', 'github', 'code', 'tc', 'rv', 'web', 'ci', 'v1', 'logs', 'windows', 'agent', 'api', 'browser', 'repo', 'log', 'treeherder', 'id', 'str', 'gecko', 'bug', 'com', 'org', 'firefox', 'https', 'mozilla'

Attached below is also the file containing the bottom 1000 least occurring words with high idf.

high_idf_words.txt

Please let me know what you think.

Thanks

Amotul-raheem avatar Oct 26 '22 22:10 Amotul-raheem