Jimmy Lin issues

Results 211 issues of


                                            Jimmy Lin

Invert anchor text and store in HBase

Store inverted anchor text in HBase along the lines for the original Bigtable paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf Schema: column family "a", qualifier is the source URLs, value is the anchor text

feature

Prototype metadata extraction application: HTML page title and size

Prototype metadata extraction: extract title (for HTML pages) and size, insert into new column family "m".

feature

Prototype Hadoop bulkloading into HBase

Use Hadoop MapReduce to generate HFiles directly for bulk loading into HBase.

feature

Refactor PageRank arg parsing to use args4

PageRank-related classes still uses commons-cli for parsing args. Refactor to args4j.

Write documentation for the Spark Streaming demo

Documentation can be taken from the 2018w iteration of my big data course: https://lintool.github.io/bigdata-2018w/assignment7-451.html

Adapt integration tests from Cloud9

Cloud9 has a bunch of integration tests that weren't copied over to Bespin - bring them over.

Adapt unit tests from Cloud9

Cloud9 has a bunch of unit tests that weren't copied over to Bespin - bring them over.

RM3 doesn't implement duplicate removal

Since the current RM3 implementation doesn't implement duplicate removal, trec_eval doesn't work on output (needs to be hand-hacked to delete duplicates).

Implement service to return term counts

We need a service to return time counts within a certain interval. Need to decide: 1. Actual implementation (separate service? squeeze into current service?) 2. Granularity? 3. Just unigrams? Arbitrary...

Update fastutil

The `fastutil` dependencies in `lintools-datatypes-fastutil` are out of date. Upgrade artifacts.