Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud

Open Arith2 opened this issue 1 year ago • 0 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not been filed already.

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
I put the object storage and compute engine in the same region.
I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
Compute Engine c2d-highcpu-32 in europe-west1-b
Specify STORAGE_BUCKET, PROJECT, REGION
Run the python command above.

Feb 21 '24 16:02 Arith2