datavis-hackathon icon indicating copy to clipboard operation
datavis-hackathon copied to clipboard

Crawl and prepare NSF ACADIS, NASA AMD and NSIDC Arctic Data Explorer datasets Part 2

Open chrismattmann opened this issue 10 years ago • 44 comments

Building off of https://github.com/NCEAS/open-science-codefest/issues/26, continue data prep and crawl of AMD, ACADIS and ADE with goal of preparing some of the data for (GeoViz; science focused viz, etc.)

Participants would use real world data science tools like Tika (http://tika.apache.org/), Nutch (http://nutch.apache.org/), Solr (http://lucene.apache.org/solr/) and OODT (http://oodt.apache.org/) to crawl and prepare the datasets of interesting Polar parameters for Visualization experts to then hack on during a 2 day NSF visualization hackathon in NYC in November. Be part of doing something real, contributing to Apache projects (and getting the merit and potentially becoming a committer and PMC member yourself) and also contributing to NSF and NASA goals!

chrismattmann avatar Oct 03 '14 05:10 chrismattmann

Recent progress of Angela:

(1) [Done] Use the Apache Nutch and Solr to crawl and index local data files (2) [Done] Index content metadata and parse metadata from the Apache Nutch to Solr. (3) [Done] Integrate the Apache OODT File Manager with the Apache Solr using the RADiX (4) [Doing] Crawl the ACADIS website using the Apache Nutch and Solr.

snowangelwmy avatar Nov 01 '14 21:11 snowangelwmy

Thanks @snowangelwmy please contact @pzimdars to get your ACADIS Nutch crawler deployed on AWS, ok?

chrismattmann avatar Nov 01 '14 21:11 chrismattmann

Ok, when I am done, I will contact @pzimdars. Thanks.

snowangelwmy avatar Nov 01 '14 22:11 snowangelwmy

Progress of Vineet;

  1. Developed a GRIB parser and progress about the feature is present - https://issues.apache.org/jira/browse/TIKA-1423
  2. An initial patch has been published at https://reviews.apache.org/r/27414/. I am working on the suggestions raised by the reviewers.

hemantku avatar Nov 02 '14 02:11 hemantku

beginning by downloading Nutch.

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

NASA AMD: http://gcmd.gsfc.nasa.gov/KeywordSearch/Keywords.do?Portal=amd&KeywordPath=Parameters%7CCRYOSPHERE&MetadataType=0&lbnode=mdlb2

NSF ACADIS:https://www.aoncadis.org/home.htm

NSIDC Arctic Data Explorer: http://nsidc.org/acadis/search/

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

hi @chrismattmann the regex-urlfilter.txt can be found here

https://www.dropbox.com/s/hl6wlvwbr4xrv81/regex-urlfilter.txt?dl=0

lewismc avatar Nov 03 '14 16:11 lewismc

Update properties in conf/nutch-default.xml:

http.agent.name = NSF DataViz Hackathon Crawler
[email protected]
http.agent.host=localhost
http.content.limit=-1
plugin.includes delete indexer-solr

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

./bin/crawl urls/ crawl http://localhost 3

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

Please make sure your JAVA_HOME environment variable is set.

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

export JAVA_HOME=/usr

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

echo $JAVA_HOME

chrismattmann avatar Nov 03 '14 16:11 chrismattmann

Download Solr:

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.2

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

Download the tika app from:

curl -k -O https://repository.apache.org/service/local/repo_groups/snapshots-group/content/org/apache/tika/tika-app/1.7-SNAPSHOT/tika-app-1.7-20141103.165816-465.jar

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

mkdir $HOME/tmp/tika
mv tika-app-1.7-20141103.165816-465.jar $HOME/tmp/tika
alias tika="java -jar $HOME/tmp/tika/tika-app-1.7-20141103.165816-465.jar"

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

tika -m ftp://sidads.colorado.edu/pub/DATASETS/AMSRE/TOOLS/land_mask/Sea_Ice_V003/amsr_gsfc_12n.hdf

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

Content-Length: 547605
Content-Type: application/x-hdf
HDF4_Version: 4.1.3 (NCSA HDF Version 4.1 Release 3, May 1999)
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.hdf.HDFParser
_History: Direct read of HDF4 file through CDM library
resourceName: amsr_gsfc_12n.hdf

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

Try:

tika --help

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

Try this:

tika -m ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G02202_v2/north/daily/2013/seaice_conc_daily_nh_f17_20130102_v02r00.nc

chrismattmann avatar Nov 03 '14 17:11 chrismattmann

Hi Prof @chrismattmann , why do I need to install tika-app? The nutch already has parse-tika component.

snowangelwmy avatar Nov 03 '14 18:11 snowangelwmy

@snowangelwmy if you look at the URL @chrismattmann defined, you will see that he's referenced a SNAPSHOT. This is so we can use some of the newer features of Tika. Try it out :) Also we are hacking Tika at this hackathon so we are using the development versions for parsing .grb files.

lewismc avatar Nov 03 '14 18:11 lewismc

Got it! I have crawled some ACADIAS web pages ("numFound": 572). However, all files that have been indexed into my solr are of type "application/xhtml+xml". I am wondering how to crawl files of the other types, e.g., pdf, jpg? Thank you!

snowangelwmy avatar Nov 03 '14 18:11 snowangelwmy

For Solr, please find the Nutch schema here:

curl -O http://svn.apache.org/repos/asf/nutch/trunk/conf/schema.xml

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

Command for checking nutch crawled data:

./bin/nutch readseg -dump ./crawl/segments/20141103100202/ output

liwwchina avatar Nov 03 '14 19:11 liwwchina

Check out this wiki page: http://wiki.apache.org/solr/SolrJetty

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

OK, ignore that wiki page. In your $HOME/tmp/solr-4.10.2/example directory, type java -jar start.jar

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

You will find this page that suggests how to fix the schema.xml issue: http://stackoverflow.com/questions/15945927/apache-nutch-and-solr-integration

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

Please comment out like so in schema.xml:

<!--                <filter class="solr.SnowballPorterFilterFactory"
                        language="English"
                    protected="protwords.txt"/>-->

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

Also ignore if it says undefined field text.

Access: http://localhost:8983/solr/

chrismattmann avatar Nov 03 '14 19:11 chrismattmann

First try running ./bin/nutch solrindex You should get back:

Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
No IndexWriters activated - check your configuration

chrismattmann avatar Nov 03 '14 19:11 chrismattmann