datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Supporting Apache Beam

Open sayakpaul opened this issue 1 year ago • 0 comments

Having the support to run the core row-level components with Apache Beam could be extremely beneficial as:

  • Apache Beam is quite widely used in the community and has a vibrant community.
  • Users have the option to choose compatible runners (such as Cloud Dataflow) to run their Beam pipelines. This helps with scalability aspects.

Some relevant examples of end-to-end Beam pipelines for NLP and beyond:

  • https://www.carted.com/blog/improving-dataflow-pipelines-for-text-data-processing/
  • https://cloud.google.com/dataflow/docs/tutorials/streaming-llm (uses transformers)
  • https://cloud.google.com/dataflow/docs/notebooks/run_inference_huggingface

sayakpaul avatar Jan 22 '24 01:01 sayakpaul