incubator-wayang icon indicating copy to clipboard operation
incubator-wayang copied to clipboard

Feature/spark dataframes

Open novatechflow opened this issue 1 month ago • 0 comments

Summary

  • Add Spark Dataset/DataFrame plumbing: Parquet source/sink flag, channel conversions, optimizer cost hints.
  • Document how to build dataset-backed pipelines (README.md, guides/spark-datasets.md).

Next steps / follow-ups

  • ML4All pipelines still emit/consume raw double[]/Double RDDs. We should extend them to use DatasetChannels once schema handling is in place.
  • Text/Object sources currently produce RDD channels. A Record-backed variant (or a conversion helper) would allow dataset output without extra user code.

novatechflow avatar Dec 15 '25 15:12 novatechflow