Add batch source for reading in file types (Parquet, JSON, CSV)
I think this in particular is just a batch source cause you can't really stream in a file. We do have a similar streaming source for GCS files but that is more of a GCS bucket watcher. Where is this is read a file and do something with the contents.
A good examples is probably bigquery although that's a little more complicated: https://github.com/launchflow/buildflow/blob/main/buildflow/runtime/ray_io/bigquery_io.py#L30
That basics of a source are:
A dataclass that defines the inputs of the source. Extends from this class: https://github.com/launchflow/buildflow/blob/main/buildflow/api/io.py#L16
It has a couple methods (to override):
- actor -> this must be implemented and returns the source actor that does the ray parallelization (discussed below)
- setup -> this does any setup work needed to read in the source (for BigQuery this sets up clients and what not)
- preprocess -> this does any preprocessing before sending to the user's process method (I'm guess you won't need this one)
A source actor that does the actual does all the ray parallelization. Extends from this class: https://github.com/launchflow/buildflow/blob/main/buildflow/runtime/ray_io/base.py#L88 (basically just need to implement the run method)