Capitol-Words icon indicating copy to clipboard operation
Capitol-Words copied to clipboard

Odd results for ". . ."

Open nickom opened this issue 10 years ago • 4 comments

http://capitolwords.org/term/..._/

Found because it was listed as the top 5 word phrase for this date: http://capitolwords.org/date/2014/04/28/

screen shot 2014-05-06 at 4 35 33 pm

nickom avatar May 06 '14 20:05 nickom

Yeah, this is a known issue. We use an ngram parser similar to Google's, which treats punctuation as distinct tokens. I believe these are low-volume days that have either sequences of dots in rollcalls or similar 'table of contents' style pages. Definitely on the list.

drinks avatar May 06 '14 20:05 drinks

Gotcha. The other thing that was so odd to me was that the highlighted examples had letters in them: screen shot 2014-05-06 at 4 51 44 pm

nickom avatar May 06 '14 20:05 nickom

Guessing that's a separate issue related to. being the regexp for 'match any character,' code here: https://github.com/sunlightlabs/Capitol-Words/blob/2bf155cd586847ea32ed294a8a3e6997e822199e/cwod_site/cwod/views.py#L318-L332

drinks avatar May 06 '14 20:05 drinks

Also, shorter versions of the dots are the top words and their links go to some server errors or 404s. Here are the links for the top words on that day:

Two words (not found): http://capitolwords.org/term/

Three words (server error): http://capitolwords.org/term/._/

Four words: http://capitolwords.org/term/../

Five words: http://capitolwords.org/term/..._/

nickom avatar May 06 '14 20:05 nickom