datatrove
datatrove copied to clipboard
Supporting Apache Beam
Having the support to run the core row-level components with Apache Beam could be extremely beneficial as:
- Apache Beam is quite widely used in the community and has a vibrant community.
- Users have the option to choose compatible runners (such as Cloud Dataflow) to run their Beam pipelines. This helps with scalability aspects.
Some relevant examples of end-to-end Beam pipelines for NLP and beyond:
- https://www.carted.com/blog/improving-dataflow-pipelines-for-text-data-processing/
- https://cloud.google.com/dataflow/docs/tutorials/streaming-llm (uses
transformers
) - https://cloud.google.com/dataflow/docs/notebooks/run_inference_huggingface