processing-text-data
processing-text-data copied to clipboard
Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow).
Processing text data at scale with Apache Beam and Cloud Dataflow
Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). This repository accompanies our blog post: Improving Dataflow Pipelines for Text Data Processing.
We assume that you already have a billing enabled Google Cloud Platform (GCP) project in case you wanted to run the pipeline on Cloud Dataflow.
Running the code locally
To run the code locally, first install the dependencies: pip install -r requirements
. If you cannot
create a Google Cloud Storage (GCS) Bucket then download the data using from
here. We just need the
train_data.txt
file for our purpose. Also, note that without a GCS Bucket, one cannot
run the pipeline on Cloud Dataflow which is the main objective of this repository.
After downloading the dataset, make changes to the respective paths and command-line
arguments that use GCS in main.py
.
Then execute python main.py -r DirectRunner
.
Running the code on Cloud Dataflow
-
Create a GCS Bucket and note its name.
-
Then create a folder called
data
inside the Bucket. -
Copy over the
train_data.txt
file to thedata
folder:gsutil cp train_data.txt gs://<BUCKET-NAME>/data
. -
Then run the following from the terminal:
python main.py \ --project <GCP-Project> \ --gcs-bucket <BUCKET-NAME> --runner DataflowRunner
For more details please refer to our blog post: Improving Dataflow Pipelines for Text Data Processing.