Add batch source for reading in file types (Parquet, JSON, CSV)

Open boetro opened this issue 2 years ago • 0 comments

I think this in particular is just a batch source cause you can't really stream in a file. We do have a similar streaming source for GCS files but that is more of a GCS bucket watcher. Where is this is read a file and do something with the contents.

A good examples is probably bigquery although that's a little more complicated: https://github.com/launchflow/buildflow/blob/main/buildflow/runtime/ray_io/bigquery_io.py#L30

That basics of a source are:

A dataclass that defines the inputs of the source. Extends from this class: https://github.com/launchflow/buildflow/blob/main/buildflow/api/io.py#L16

It has a couple methods (to override):

actor -> this must be implemented and returns the source actor that does the ray parallelization (discussed below)
setup -> this does any setup work needed to read in the source (for BigQuery this sets up clients and what not)
preprocess -> this does any preprocessing before sending to the user's process method (I'm guess you won't need this one)

A source actor that does the actual does all the ray parallelization. Extends from this class: https://github.com/launchflow/buildflow/blob/main/buildflow/runtime/ray_io/base.py#L88 (basically just need to implement the run method)

May 10 '23 19:05 boetro