Jimmy Lin issues

Results 211 issues of


                                            Jimmy Lin

Re-upgrade Guava (UKWA's WARC Hadoop indexer dependency)

I had to downgrade Guava to accomodate UKWA's WARC Hadoop indexer: https://github.com/lintool/warcbase/blob/master/pom.xml#L277 But this issue now appears to be fixed: https://github.com/ukwa/webarchive-discovery/commit/969ff3528219267ff31cfa3169f099a9df6a0567 Thanks @anjackson

cleanup

Automatic topic issues classifier

@ianmilligan1 Do you know about this? http://www.policyagendas.org/page/topic-codebook Do you think it'd be useful to build a topic classifier and integrate it in Warcbase? The classifier would take in a blob...

feature

Loading HBase config on startup

Currently, we require the user to do this: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" ``` In order to get the configs loaded correctly... fix this via `conf.addResource`.

cleanup

Page load latency evaluation

Build a simple Selenium-based program to measure page-load latency on Warcbase.

feature

Add documentation about pywb-warcbase

Add documentation referencing pywb-warcbase as a alternative front-end to OpenWayback: https://github.com/ikreymer/pywb-warcbase

documentation

Wayback issue with http://appropriations.house.gov/

If you visit `http://appropriations.house.gov/`, the navbar on the left attempts to take you to the live web, not an archived capture.

bug

Build CLI interface for admin metadata table

Basic idea is to have a `warcbase.meta` table for storing collection-level metadata, e.g., - the Lucene FST for mapping URL id - record of data ingestion - ARC/WARC - etc....

feature

Prototype webapp for browsing link structure

Jinfeng: Once links are extracted from the ARC data, they should be in the form of (source, destination) pairs, where both are ids mapped using the Lucene FST. This can...

feature

Ingestion bug in copyStream, wrong number of bytes expected

``` 14/08/10 09:13:18 ERROR ingest.IngestFiles: Error ingesting file: /scratch0/webarchive/congress108/arc.sample/CONGRESS01-20040124072939-193.arc.gz java.io.IOException: Read 394 but expected 439 at org.warcbase.ingest.IngestFiles.copyStream(IngestFiles.java:63) at org.warcbase.ingest.IngestFiles.ingestArcFile(IngestFiles.java:102) at org.warcbase.ingest.IngestFiles.ingestFolder(IngestFiles.java:163) at org.warcbase.ingest.IngestFiles.main(IngestFiles.java:220) ```

bug

Add option to ExtractSiteLinks to scan HBase table

Similar to issue #42 except for `ExtractSiteLinks`

feature