elixir-web-and-data icon indicating copy to clipboard operation
elixir-web-and-data copied to clipboard

A notepad on Big-data Elixir and HTTP

Common points for a distributed architecture

  • Open source tools
  • Elixir is the future of data handling and distribution on the Web
  • Lambda or pseudo-lambda ? or whatever
  • A close interaction among Elixir and Python tools and libraries (via Web interface or not)
  • Low barriers in terms of learning curve and computation costs
  • Data-scientists friendly: easy for smaller sets, growing complexity for analytic tools (map-reduce)
  • High scalability guaranteed by Erlang: from small focused datasets (data and aggregates) to cloud deployments, using microservices clusters and Kubernetes

Links

Hadoop

DOA

Apache Spark2 - Java or Python interface
Apache Storm1 - JVM + Clojure DSL; uses Apache Thrift, so non-JVM interfaces are feasible
Apache Samza2 - JVM + Clojure DSL1; non-JVM support is on the roadmap1
Apache Flink - JVM-only; example of Clojure implementation1

Messaging and queues

Apache Kafka is a high-throughput distributed messaging system.

  • KafkaEx someone has dutifully made an Elixir library for Kafka using a binary interface
  • RethinkDB

Databases

  • VoltDB In-memory, alpha drivers for Erlang
  • CockRoachDB Distributed SQL database built on top of a transactional and consistent key:value store

Highly Distributed File Systems for data

Querying directly on FS

map and reduce

Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It is written in Erlang, but exposes a Python interface; I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.

Experiences and blogposts

Performance

Possible applications

Geodata

Architecture design

Example for data pipeline

  • Hypothesis: ** low-cost implementation, ** querying directly the FS, ** heavy reads, ** limited writes, ** two 'kinds' of services in parallel: live and archive (pseudo-lambda)
  • Data pipeline: ** Stack reads/writes on live data ** Stack does archive only every x seconds (a 'tick') ** Stack runs analytics (aggregates analysis) by reading archive data from the FS with a delay of x seconds (supported by caching)
  • Architecture: ... ... ...