Processing text data at scale with Apache Beam and Cloud Dataflow

Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). This repository accompanies our blog post: Improving Dataflow Pipelines for Text Data Processing.

We assume that you already have a billing enabled Google Cloud Platform (GCP) project in case you wanted to run the pipeline on Cloud Dataflow.

Running the code locally

To run the code locally, first install the dependencies: pip install -r requirements. If you cannot create a Google Cloud Storage (GCS) Bucket then download the data using from here. We just need the train_data.txt file for our purpose. Also, note that without a GCS Bucket, one cannot run the pipeline on Cloud Dataflow which is the main objective of this repository.

After downloading the dataset, make changes to the respective paths and command-line arguments that use GCS in main.py.

Then execute python main.py -r DirectRunner.

Running the code on Cloud Dataflow

Create a GCS Bucket and note its name.
Then create a folder called data inside the Bucket.
Copy over the train_data.txt file to the data folder: gsutil cp train_data.txt gs://<BUCKET-NAME>/data.

Then run the following from the terminal:

python main.py \
    --project <GCP-Project> \
    --gcs-bucket <BUCKET-NAME>
    --runner DataflowRunner

For more details please refer to our blog post: Improving Dataflow Pipelines for Text Data Processing.

processing-text-data
processing-text-data copied to clipboard

Metadata

Processing text data at scale with Apache Beam and Cloud Dataflow

Running the code locally

Running the code on Cloud Dataflow

← Metadata

Owner

Metadata

processing-text-data processing-text-data copied to clipboard

Metadata

Processing text data at scale with Apache Beam and Cloud Dataflow

Running the code locally

Running the code on Cloud Dataflow

← Metadata

Owner

Metadata

processing-text-data
processing-text-data copied to clipboard