Workbench
Workbench copied to clipboard
Java and Lucene based tools for BitFunnel corpus preparation
This seems like it would screw up phrase queries? An alternate fix would be to add Lucene's stopword list to our parser and submit the appropriately modified phrase query.
After processing wikipedia with the fixes as of `274293f3af97c507416f6387020507ee99ca3238`, the tail of the DocFreqTable has a lot of n-grams: ~~~ 724ddeaf8cb3c269,1,0,1.93455e-07,Vasilije Veljko Milovanović e802585d5e004af1,1,0,1.93455e-07,2014 All-Arena Team 7c401744d5d61355,1,1,1.93455e-07,f.a.cortez dafa24ba41b2a01d,1,0,1.93455e-07,Coeliades ramanatek 1a8055b58daaf330,1,0,1.93455e-07,Jeff...
If we look at the wikipedia dump currently hosted on Azure, the modal number of postings per document is `5`, and things drop off rapidly from there: ~~~ Postings,Count 0,5...
See: https://en.wikipedia.org/?curid=28831157 https://en.wikipedia.org/?curid=1468119 https://en.wikipedia.org/?curid=31533859 In combination with the `--lists` issue, this is resulting in empty documents instead of 1 posting documents. With the `--lists` issue fixed, these should turn into...
For example: ~~~ 6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur 3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college ~~~ We also have terms with double underscores that appear to be some kind of metadata? ~~~ 868661c0426526a7,1,1,0.000557102,__noeditsection__ a135c90cbb896da0,1,1,2.97521e-05,__notoc__ 14a64ebade034c85,1,1,3.11359e-06,__nogallery__ ~~~ As well as weird...
I thought that we were doing this. If we're not and that's on purpose, that's fine, but we have (for example) `downlink`, `downlinks`, and `downlinked` in our DocumentFrequencyTable when we...
Repro: BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9 Workbench: 580b74b421254f82348a811d7a886683c54c5a75 StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text Shouldn't have bigrams, shouldn't have capital letters: Bigram where none expected (also capital letter): 72a2c4b53c781027,1,1,0.000144196,zephyrinus bd01f0b68e57b2a7,1,1,0.000144196,sveshtari 3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry 50c9029d9d3c5378,1,1,0.000144196,darabont a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü...
The corpus as processed by the current version of Workbench contains characters (mostly punctuation) that cause the BitFunnel parser to crash. This commit will cause Workbench to handle these cases...
Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327...
Near the bottom of the readme, the output section actually shows the input files. This should show the chunk files instead. Here's the output $ ls -l sample-output/ total 8...