spookystuff
spookystuff copied to clipboard
Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark
Latest doc already moved to:
http://tribbloid.github.io/spookystuff/
SpookyStuff
... is a scalable query engine for web scraping/data integration/acceptance QA. The goal is to allow the Web being queried and ETL'ed like a relational database.
SpookyStuff is the fastest big data collection engine in history, with a speed record of querying 330404 dynamic pages per hour on 300 cores.
Build Status
branch \ profile | scala-2.11 | scala-2.12 |
---|---|---|
master | |
SpookyStuff-UAV (alpha component)
... allows the same engine to be used to control a swarm of aerial robots for photogrammetry and data acquisition. It is still a work in progress, please refer to this proposal for a feature and implementation overview.
Build Status
branch \ profile | scala-2.11 | scala-2.12 |
---|---|---|
master | - |
Powered by
Core | Apache Spark![]() |
Apache Maven![]() |
JSoup![]() |
Apache Tika![]() |
|||
Web | Yourkit Java Profiler ![]() |
PhantomJS/GhostDriver![]() |
Selenium![]() |
UAV | MAVLink![]() |
License
Copyright © 2014 by Peng Cheng @tribbloid, Sandeep Singh @techaddict, Terry Lin @ithinkicancode, Long Yao @l2yao and contributors.
Published under ASF License, see LICENSE.