A notepad on Big-data Elixir and HTTP

Common points for a distributed architecture

Open source tools
Elixir is the future of data handling and distribution on the Web
Lambda or pseudo-lambda ? or whatever
A close interaction among Elixir and Python tools and libraries (via Web interface or not)
Low barriers in terms of learning curve and computation costs
Data-scientists friendly: easy for smaller sets, growing complexity for analytic tools (map-reduce)
High scalability guaranteed by Erlang: from small focused datasets (data and aggregates) to cloud deployments, using microservices clusters and Kubernetes

Links

Hadoop

HttpFS (REST API for HDFS)
C API libhdfs could be wrapped with an Elixir NIF

DOA

Apache Spark2 - Java or Python interface
Apache Storm1 - JVM + Clojure DSL; uses Apache Thrift, so non-JVM interfaces are feasible
Apache Samza2 - JVM + Clojure DSL1; non-JVM support is on the roadmap1
Apache Flink - JVM-only; example of Clojure implementation1

Messaging and queues

Apache Kafka is a high-throughput distributed messaging system.

KafkaEx someone has dutifully made an Elixir library for Kafka using a binary interface
RethinkDB

Databases

VoltDB In-memory, alpha drivers for Erlang
CockRoachDB Distributed SQL database built on top of a transactional and consistent key:value store

Highly Distributed File Systems for data

HDFS (used by Hadoop)
Hierarchical Data Format or HDF5 (used by many research facilities). It's a standard to work with data spawning on many machines
netCDF4 (a network standard to exchange HDF5 data)
DiscoFS

Querying directly on FS

map and reduce

Disco

Disco is a distributed map-reduce and big-data framework that is similar to Hadoop. It is written in Erlang, but exposes a Python interface; I have not found any Erlang API docs, but it should be feasible to create an Elixir library from it.

Experiences and blogposts

Erlang Big-data track

Performance

Possible applications

Geodata

Defining a microservices architecture to serve data using standard interfaces (REST)

Every microservices can handle few variables indexed by geopoints and their aggregates

Architecture design

Example for data pipeline

Hypothesis: ** low-cost implementation, ** querying directly the FS, ** heavy reads, ** limited writes, ** two 'kinds' of services in parallel: live and archive (pseudo-lambda)
Data pipeline: ** Stack reads/writes on live data ** Stack does archive only every x seconds (a 'tick') ** Stack runs analytics (aggregates analysis) by reading archive data from the FS with a delay of x seconds (supported by caching)
Architecture: ... ... ...

elixir-web-and-data
elixir-web-and-data copied to clipboard

Metadata

A notepad on Big-data Elixir and HTTP

Common points for a distributed architecture

Links

Hadoop

DOA

Messaging and queues

Databases

Highly Distributed File Systems for data

Querying directly on FS

map and reduce

Experiences and blogposts

Performance

Possible applications

Geodata

Architecture design

Example for data pipeline

← Metadata

Owner

Metadata

elixir-web-and-data elixir-web-and-data copied to clipboard

Metadata

A notepad on Big-data Elixir and HTTP

Common points for a distributed architecture

Links

Hadoop

DOA

Messaging and queues

Databases

Highly Distributed File Systems for data

Querying directly on FS

map and reduce

Experiences and blogposts

Performance

Possible applications

Geodata

Architecture design

Example for data pipeline

← Metadata

Owner

Metadata

elixir-web-and-data
elixir-web-and-data copied to clipboard