elixir-web-and-data
elixir-web-and-data copied to clipboard
A notepad on Big-data Elixir and HTTP
Common points for a distributed architecture
- Open source tools
- Elixir is the future of data handling and distribution on the Web
- Lambda or pseudo-lambda ? or whatever
- A close interaction among Elixir and Python tools and libraries (via Web interface or not)
- Low barriers in terms of learning curve and computation costs
- Data-scientists friendly: easy for smaller sets, growing complexity for analytic tools (map-reduce)
- High scalability guaranteed by Erlang: from small focused datasets (data and aggregates) to cloud deployments, using microservices clusters and Kubernetes
Links
Hadoop
- HttpFS (REST API for HDFS)
- C API libhdfs could be wrapped with an Elixir NIF
DOA
Apache Spark2 - Java or Python interface
Apache Storm1 - JVM + Clojure DSL; uses Apache Thrift, so non-JVM interfaces are feasible
Apache Samza2 - JVM + Clojure DSL1; non-JVM support is on the roadmap1
Apache Flink - JVM-only; example of Clojure implementation1
Messaging and queues
Apache Kafka is a high-throughput distributed messaging system.
Databases
- VoltDB In-memory, alpha drivers for Erlang
- CockRoachDB Distributed SQL database built on top of a transactional and consistent key:value store
Highly Distributed File Systems for data
- HDFS (used by Hadoop)
- Hierarchical Data Format or HDF5 (used by many research facilities). It's a standard to work with data spawning on many machines
- netCDF4 (a network standard to exchange HDF5 data)
- DiscoFS
Querying directly on FS
map and reduce
Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It is written in Erlang, but exposes a Python interface; I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.
Experiences and blogposts
Performance
- Benchmark post of this from some engineers at Yahoo
- Software Engineering Daily podcast talking about the results
Possible applications
Geodata
-
Defining a microservices architecture to serve data using standard interfaces (REST)
Every microservices can handle few variables indexed by geopoints and their aggregates
Architecture design
Example for data pipeline
- Hypothesis: ** low-cost implementation, ** querying directly the FS, ** heavy reads, ** limited writes, ** two 'kinds' of services in parallel: live and archive (pseudo-lambda)
- Data pipeline: ** Stack reads/writes on live data ** Stack does archive only every x seconds (a 'tick') ** Stack runs analytics (aggregates analysis) by reading archive data from the FS with a delay of x seconds (supported by caching)
- Architecture: ... ... ...