Dan Luu
Dan Luu
We should enforce some kind of style with `ClangFormat`. @MikeHopcroft , let's discuss the particulars of what will be enforced when I'm back in town.
What's leftover? I took a look at this and it wasn't obvious to me what still needs to be removed.
Sorry, I haven't touched scala since my last commit to this project and don't remember anything helpful.
Of the `10677410` terms that appear once, `4512265` (or 42%) are n-grams. This is out of `14643204` total terms.
Note that the n-grams don't seem to be downcased.
If you want something from chunked1, that has ~~~ 3b15cf09a2fde054,1,1,6.59631e-05,______next 664c5c0a691d85f4,1,1,6.59631e-05,x__x b1863a3a0b343641,1,1,6.59631e-05,20__ ~~~
[This document](https://en.wikipedia.org/?curid=36699652) turns out to be 0 sized, which seems a bit surprising. It has content in it, and the content has been there for years, so it's not that...
BTW, here are the chunk files with 0 lengths after filtering are: ~~~ -rw-rw-r-- 1 danluu danluu 27 Dec 7 23:37 Chunk-1361.chunk -rw-rw-r-- 1 danluu danluu 27 Dec 7 23:35...
Does that fix change [36699652](https://en.wikipedia.org/?curid=36699652)? It shouldn't be zero length anymore if the list is included, but it looks like it shouldn't have been zero length int he first place.
After the last set of fixes, the mode has changed from `5` to `24`. We no longer have any 0-length documents and the number of 1-length documents went down from...