distributed-dataset
                                
                                 distributed-dataset copied to clipboard
                                
                                    distributed-dataset copied to clipboard
                            
                            
                            
                        A distributed data processing framework in Haskell.
distributed-dataset
A distributed data processing framework in pure Haskell. Inspired by Apache Spark.
- An example: /examples/gh/Main.hs
- API documentation: https://utdemir.github.io/distributed-dataset/
- Introduction blogpost: https://utdemir.com/posts/ann-distributed-dataset.html
Packages
distributed-dataset
This package provides a Dataset type which lets you express and execute
transformations on a distributed multiset. Its API is highly inspired
by Apache Spark.
It uses pluggable Backends for spawning executors and ShuffleStores
for exchanging information. See 'distributed-dataset-aws' for an
implementation using AWS Lambda and S3.
It also exposes a more primitive Control.Distributed.Fork
module which lets you run IO actions remotely. It
is especially useful when your task is embarrassingly
parallel.
distributed-dataset-aws
This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.
distributed-dataset-opendatasets
Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.
Running the example
- 
Clone the repository. $ git clone https://github.com/utdemir/distributed-dataset $ cd distributed-dataset
- 
Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run: $ aws configure
- 
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI: $ aws s3api create-bucket --bucket my-s3-bucket
- 
Build an run the example: - 
If you use Nix on Linux: - (Recommended) Use my binary cache on Cachix to reduce compilation times:
 nix-env -i cachix # or your preferred installation method cachix use utdemir- 
Then: $ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
 
- 
If you use stack (requires Docker, works on Linux and MacOS): $ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
 
- 
Stability
Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.
Contributing
I am open to contributions; any issue, PR or opinion is more than welcome.
- In order to develop distributed-dataset, you can use;- On Linux: Nix,cabal-installorstack.
- On MacOS: stackwithdocker.
 
- On Linux: 
- Use ormolu to format source code.
Nix
- You can use my binary cache on cachix so that you don't recompile half of the Hackage.
- nix-shellwill drop you into a shell with- ormolu,- cabal-installand- steeloverseeralongside with all required haskell and system dependencies. You can use- cabal new-*commands there.
- Easiest way to get a development environment would be to run sosat the top level directory inside of a nix-shell.
Stack
- Make sure that you have Dockerinstalled.
- Use stackas usual, it will automatically use a Docker image
- Run ./make.sh stack-buildbefore you send a PR to test different resolvers.
Related Work
Papers
- Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.
Projects
- Apache Spark.
- Sparkle: Run Haskell on top of Apache Spark.
- HSpark: Another attempt at porting Apache Spark to Haskell.