ewrt
ewrt copied to clipboard
extensible Web Retrieval Toolkit
extensible Web Retrieval Toolkit (eWRT)
The Extensible Web Retrieval Toolkit (eWRT) is a modular open-source Python API which
- offers a unified interface for retrieving social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia,
- includes various helper classes for effective caching and data management,
- provides components for low-level natural language processing functionalities such as language detection, phonetic string similarity measures, and methods for string normalization.
Quickstart:
adjust
eWRT/src/siteconfig.py-sample
to your setting and save it to
~/.eWRT/siteconfig.py(user specific settings) and/or/etc/eWRT/siteconfig.py(system wide settings)
Packages:
eWRT.access- file, Web and database accessdb- database accessfile- file accesshttp- access web resources supporting authentication (basic, digest), compression, etc.javascript- control Firefox to extract AJAX pages
eWRT.input- input and cleanup modulesclean- clean and normalize text phrasesconv- convert doc, html and pdf files to text documents; convert XCL to rdfcorpus- input readers for the Reuters and BBC corpuscsv- read and analyze csv filesstock- stock quotes
eWRT.ontology- tools for comparing, evaluating and visualizing ontologiescompare- compare ontology nodes, relations, and relation typeseval- determine the coherence of ontology nodesvisualize- visualize ontologies
eWRT.stat- the eWRT statistics packagescoherence- compute the coherence between terms (Dice, PMI)metrics- evaluation metrics (precision, recall, F1)language- simple language detectionstring- word (Levenshtein, Damerau-Levenshtein, Soundex, ...) and document (Vector Space Model) similarity metrics
eWRT.util- utility classes for transparent caching, logging, monitoring, etc.advLogging- log to SNMP handlerassert- assertion based counters (decorators)async- asynchronous procedure calls (experimental)cache- transparent memory and disk caching of function calls (decorators)exception- SNMP exception handlingloggerProfile- simplified loggingmodule_path- compute relative pathsmonitoring- support for Nagios NSCA servicespickleIterator- iterate over objects stored in pickle filesprofile- python profilingtiming- time python methods (decorators)
eWRT.visualize- eWRT visualization libraryeWRT.ws- Web service access (REST, Amazon, Flickr, Facebook, ...)amazonconceptnetdeliciousfacebookflickrgeonamesgooglegooglealertsgoogletrendslinkedinopencalaisrest- efficiently access/publish REST servicesrsstechnoratitwitterwikidatawikipediawordnetwotyahooyoutube
Requirements:
- python-libraries:
- facebook api - http://code.google.com/p/pyfacebook/
- google-trends api - http://github.com/suryasev/unofficial-google-trends-api/tree/master
- oauth - http://oauth.googlecode.com/
- simplejson - http://pypi.python.org/pypi/simplejson/
- tango - http://tango.ryanmcgrath.org/
- python-rdflib
- python-nltk
- python-feedparser (eWRT.ws.rss)
- pywikibot (eWRT.ws.wikidata)
- text conversion (eWRT.input.conv):
- lynx
- pdftotext (poppler-utils)
- antiword
