warcbase
warcbase copied to clipboard
Prototype webapp for browsing link structure
Jinfeng: Once links are extracted from the ARC data, they should be in the form of (source, destination) pairs, where both are ids mapped using the Lucene FST. This can be stored in a table inside Derby.
(Remember, don't actually check the data into Git.)
We can embed Derby directly inside a Jetty webapp to avoid the need for a MySQL installation (the tables should be small enough to make the performance reasonable). Let's start will simple functionality like being able to browse forward and backward links, e.g., given a page (URL or id), list both incoming and outing edges.
Here's my revised thinking on this: we should probably just store the link structure into an HBase table:
row key = reversed URL (e.g., house.gov.www) column family = "link" column qualifier = timestamp of capture value = list of ids, e.g., "353,546,503" (ids of outgoing links)
So we should have a program along these lines:
hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \
org.warcbase.analysis.graph.IngestWebgraph \
-hdfs INPUT_PATH -output OUTPUT_HBASE_TABLE \
-urlMapping sampleUrls.fst
We can probably modify ExtractLinks for this purpose, although instead of the reducer writing to HDFS we'd need to write to a HBase table.
What sort of queries will be made by the webapp when browsing this link structure table?
Here's what I'm thinking the REST API will look like, modeled after the Wayback REST schema: http://host/table/URL, where the URL would be something like http://foo.bar.com/. The output would be JSON, something like:
[ { capture : TIMESTAMP1, outlinks : [253, 325, 573 ...] },
{ capture : TIMESTAMP2, outlinks : [253, 325, 573 ...] },
...
]
Clients will have to then make a new query against hbase to get the actual URLs or will they have a bunch cached over on their side?
The id -> URL mapping is encoded in a Lucene FST, which is relatively compact. The client can load this in memory and perform the mapping. The idea (mostly in my head, haven't converted into open issues) is that there would be a "metadata" table to hold collection-level metadata like these: the FST, # of docs, ingestion date, etc.