datavis-hackathon
datavis-hackathon copied to clipboard
Crawl and prepare NSF ACADIS, NASA AMD and NSIDC Arctic Data Explorer datasets Part 2
Building off of https://github.com/NCEAS/open-science-codefest/issues/26, continue data prep and crawl of AMD, ACADIS and ADE with goal of preparing some of the data for (GeoViz; science focused viz, etc.)
Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!
Recent progress of Angela:
(1) [Done] Use the Apache Nutch and Solr to crawl and index local data files (2) [Done] Index content metadata and parse metadata from the Apache Nutch to Solr. (3) [Done] Integrate the Apache OODT File Manager with the Apache Solr using the RADiX (4) [Doing] Crawl the ACADIS website using the Apache Nutch and Solr.
Thanks @snowangelwmy please contact @pzimdars to get your ACADIS Nutch crawler deployed on AWS, ok?
Ok, when I am done, I will contact @pzimdars. Thanks.
Progress of Vineet;
- Developed a GRIB parser and progress about the feature is present - https://issues.apache.org/jira/browse/TIKA-1423
- An initial patch has been published at https://reviews.apache.org/r/27414/. I am working on the suggestions raised by the reviewers.
beginning by downloading Nutch.
NASA AMD: http://gcmd.gsfc.nasa.gov/KeywordSearch/Keywords.do?Portal=amd&KeywordPath=Parameters%7CCRYOSPHERE&MetadataType=0&lbnode=mdlb2
NSF ACADIS:https://www.aoncadis.org/home.htm
NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/
hi @chrismattmann the regex-urlfilter.txt can be found here
https://www.dropbox.com/s/hl6wlvwbr4xrv81/regex-urlfilter.txt?dl=0
Update properties in conf/nutch-default.xml:
http.agent.name = NSF DataViz Hackathon Crawler
[email protected]
http.agent.host=localhost
http.content.limit=-1
plugin.includes delete indexer-solr
./bin/crawl urls/ crawl http://localhost 3
Please make sure your JAVA_HOME environment variable is set.
export JAVA_HOME=/usr
echo $JAVA_HOME
Download Solr:
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.2
Download the tika app from:
curl -k -O https://repository.apache.org/service/local/repo_groups/snapshots-group/content/org/apache/tika/tika-app/1.7-SNAPSHOT/tika-app-1.7-20141103.165816-465.jar
mkdir $HOME/tmp/tika
mv tika-app-1.7-20141103.165816-465.jar $HOME/tmp/tika
alias tika="java -jar $HOME/tmp/tika/tika-app-1.7-20141103.165816-465.jar"
tika -m ftp://sidads.colorado.edu/pub/DATASETS/AMSRE/TOOLS/land_mask/Sea_Ice_V003/amsr_gsfc_12n.hdf
Content-Length: 547605
Content-Type: application/x-hdf
HDF4_Version: 4.1.3 (NCSA HDF Version 4.1 Release 3, May 1999)
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.hdf.HDFParser
_History: Direct read of HDF4 file through CDM library
resourceName: amsr_gsfc_12n.hdf
Try:
tika --help
Try this:
tika -m ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02202_v2/north/daily/2013/seaice_conc_daily_nh_f17_20130102_v02r00.nc
Hi Prof @chrismattmann , why do I need to install tika-app? The nutch already has parse-tika component.
@snowangelwmy if you look at the URL @chrismattmann defined, you will see that he's referenced a SNAPSHOT. This is so we can use some of the newer features of Tika. Try it out :) Also we are hacking Tika at this hackathon so we are using the development versions for parsing .grb files.
Got it! I have crawled some ACADIAS web pages ("numFound": 572). However, all files that have been indexed into my solr are of type "application/xhtml+xml". I am wondering how to crawl files of the other types, e.g., pdf, jpg? Thank you!
For Solr, please find the Nutch schema here:
curl -O http://svn.apache.org/repos/asf/nutch/trunk/conf/schema.xml
Command for checking nutch crawled data:
./bin/nutch readseg -dump ./crawl/segments/20141103100202/ output
Check out this wiki page: http://wiki.apache.org/solr/SolrJetty
OK, ignore that wiki page. In your $HOME/tmp/solr-4.10.2/example directory, type java -jar start.jar
You will find this page that suggests how to fix the schema.xml issue: http://stackoverflow.com/questions/15945927/apache-nutch-and-solr-integration
Please comment out like so in schema.xml:
<!-- <filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>-->
Also ignore if it says undefined field text.
Access: http://localhost:8983/solr/
First try running ./bin/nutch solrindex You should get back:
Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
No IndexWriters activated - check your configuration