distributed-dataset issues

Support more open datasets

8

This is an umbrella issue to gather useful datasets which can be used freely. Things to consider: * We should be able to access them freely & quickly. It helps...

utdemir

opendatasets

Kubernetes backend

k8s seems to be everywhere, and many major cloud providers support it. We can simply package the binary as a Docker container (with an added advantage that it doesn't require...

utdemir

new feature

Concurrent executors

* Run read and write ends of the conduits concurrently. * When reading a `Partition` created by `` operations, consume the smaller partitions in parallel.

utdemir

performance

good first issue

Parquet Support

1

Parquet is a commonly used data format, but sadly Haskell ecosystem is lacking a mature library. If we implement a library to encode/decode Parquet files, we can both use it...

utdemir

new feature

Introduce Distributed Parquet Reader

3

This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet. Running the example using Nix: ``` # While inside distributed-dataset directory...

yigitozkavci

Google Cloud Functions backend

6

Would it be easy to build a Google Cloud Functions backend(not looked at the code how you have done it for aws lambda though)?

yaman

new feature

Add an utility function to use S3 files as a `Dataset`

utdemir

aws

YARN Backend

1

YARN is the most common way to schedule Spark & Hadoop on a cluster. Supporting it as an executor will enable us to run side-by-side with existing data processing pipelines.

utdemir

new feature

Utility functions to read different file formats

Currently, we expect users to write a `Conduit` to read data from external sources. This is quite easy, however it would be even better to provide some combinators to use...

utdemir

usability

Test `stack` and `cabal` on CI

Currently the CI only tests Nix builds. We should also test stack and cabal. We can migrate away from Travis while doing this.

utdemir

code-quality

distributed-dataset
distributed-dataset copied to clipboard

Metadata

Support more open datasets

Kubernetes backend

Concurrent executors

Parquet Support

Introduce Distributed Parquet Reader

Google Cloud Functions backend

Add an utility function to use S3 files as a `Dataset`

YARN Backend

Utility functions to read different file formats

Test `stack` and `cabal` on CI

← Metadata

Owner

Metadata

distributed-dataset distributed-dataset copied to clipboard

Metadata

← Metadata

Owner

Metadata

distributed-dataset
distributed-dataset copied to clipboard