Jimmy Lin
Jimmy Lin
Store inverted anchor text in HBase along the lines for the original Bigtable paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf Schema: column family "a", qualifier is the source URLs, value is the anchor text
Prototype metadata extraction: extract title (for HTML pages) and size, insert into new column family "m".
Use Hadoop MapReduce to generate HFiles directly for bulk loading into HBase.
PageRank-related classes still uses commons-cli for parsing args. Refactor to args4j.
Documentation can be taken from the 2018w iteration of my big data course: https://lintool.github.io/bigdata-2018w/assignment7-451.html
Cloud9 has a bunch of integration tests that weren't copied over to Bespin - bring them over.
Cloud9 has a bunch of unit tests that weren't copied over to Bespin - bring them over.
Since the current RM3 implementation doesn't implement duplicate removal, trec_eval doesn't work on output (needs to be hand-hacked to delete duplicates).
We need a service to return time counts within a certain interval. Need to decide: 1. Actual implementation (separate service? squeeze into current service?) 2. Granularity? 3. Just unigrams? Arbitrary...
The `fastutil` dependencies in `lintools-datatypes-fastutil` are out of date. Upgrade artifacts.