distributed-dataset
distributed-dataset copied to clipboard
A distributed data processing framework in Haskell.
This is an umbrella issue to gather useful datasets which can be used freely. Things to consider: * We should be able to access them freely & quickly. It helps...
k8s seems to be everywhere, and many major cloud providers support it. We can simply package the binary as a Docker container (with an added advantage that it doesn't require...
* Run read and write ends of the conduits concurrently. * When reading a `Partition` created by `` operations, consume the smaller partitions in parallel.
Parquet is a commonly used data format, but sadly Haskell ecosystem is lacking a mature library. If we implement a library to encode/decode Parquet files, we can both use it...
This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet. Running the example using Nix: ``` # While inside distributed-dataset directory...
Would it be easy to build a Google Cloud Functions backend(not looked at the code how you have done it for aws lambda though)?
YARN is the most common way to schedule Spark & Hadoop on a cluster. Supporting it as an executor will enable us to run side-by-side with existing data processing pipelines.
Currently, we expect users to write a `Conduit` to read data from external sources. This is quite easy, however it would be even better to provide some combinators to use...
Currently the CI only tests Nix builds. We should also test stack and cabal. We can migrate away from Travis while doing this.