Jimmy Lin
Jimmy Lin
I had to downgrade Guava to accomodate UKWA's WARC Hadoop indexer: https://github.com/lintool/warcbase/blob/master/pom.xml#L277 But this issue now appears to be fixed: https://github.com/ukwa/webarchive-discovery/commit/969ff3528219267ff31cfa3169f099a9df6a0567 Thanks @anjackson
@ianmilligan1 Do you know about this? http://www.policyagendas.org/page/topic-codebook Do you think it'd be useful to build a topic classifier and integrate it in Warcbase? The classifier would take in a blob...
Currently, we require the user to do this: ``` $ setenv CLASSPATH_PREFIX "/etc/hbase/conf/" ``` In order to get the configs loaded correctly... fix this via `conf.addResource`.
Build a simple Selenium-based program to measure page-load latency on Warcbase.
Add documentation referencing pywb-warcbase as a alternative front-end to OpenWayback: https://github.com/ikreymer/pywb-warcbase
If you visit `http://appropriations.house.gov/`, the navbar on the left attempts to take you to the live web, not an archived capture.
Basic idea is to have a `warcbase.meta` table for storing collection-level metadata, e.g., - the Lucene FST for mapping URL id - record of data ingestion - ARC/WARC - etc....
Jinfeng: Once links are extracted from the ARC data, they should be in the form of (source, destination) pairs, where both are ids mapped using the Lucene FST. This can...
``` 14/08/10 09:13:18 ERROR ingest.IngestFiles: Error ingesting file: /scratch0/webarchive/congress108/arc.sample/CONGRESS01-20040124072939-193.arc.gz java.io.IOException: Read 394 but expected 439 at org.warcbase.ingest.IngestFiles.copyStream(IngestFiles.java:63) at org.warcbase.ingest.IngestFiles.ingestArcFile(IngestFiles.java:102) at org.warcbase.ingest.IngestFiles.ingestFolder(IngestFiles.java:163) at org.warcbase.ingest.IngestFiles.main(IngestFiles.java:220) ```
Similar to issue #42 except for `ExtractSiteLinks`