distributed-dataset icon indicating copy to clipboard operation
distributed-dataset copied to clipboard

A distributed data processing framework in Haskell.

Results 19 distributed-dataset issues
Sort by recently updated
recently updated
newest added

This is an umbrella issue to gather useful datasets which can be used freely. Things to consider: * We should be able to access them freely & quickly. It helps...

opendatasets

k8s seems to be everywhere, and many major cloud providers support it. We can simply package the binary as a Docker container (with an added advantage that it doesn't require...

new feature

* Run read and write ends of the conduits concurrently. * When reading a `Partition` created by `` operations, consume the smaller partitions in parallel.

performance
good first issue

Parquet is a commonly used data format, but sadly Haskell ecosystem is lacking a mature library. If we implement a library to encode/decode Parquet files, we can both use it...

new feature

This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet. Running the example using Nix: ``` # While inside distributed-dataset directory...

Would it be easy to build a Google Cloud Functions backend(not looked at the code how you have done it for aws lambda though)?

new feature

YARN is the most common way to schedule Spark & Hadoop on a cluster. Supporting it as an executor will enable us to run side-by-side with existing data processing pipelines.

new feature

Currently, we expect users to write a `Conduit` to read data from external sources. This is quite easy, however it would be even better to provide some combinators to use...

usability

Currently the CI only tests Nix builds. We should also test stack and cabal. We can migrate away from Travis while doing this.

code-quality