models icon indicating copy to clipboard operation
models copied to clipboard

Apache Beam Pipeline cannot maximize the number of workers for criteo_preprocess.py in Google Cloud

Open Arith2 opened this issue 1 year ago • 0 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [x] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/recommendation/ranking/preprocessing/criteo_preprocess.py

2. Describe the bug

  1. Apache Beam Pipeline cannot maximize the number of workers to increase parallelism for preprocessing in Google Cloud
  2. I put the object storage and compute engine in the same region.
  3. I use "gsutil perfdiag -n 10 -s 100M -c 1 gs://my_storage" to test the throughput of Google Cloud Storage, 876 Mbit/s for writing, 1.56 Gbit/s for reading.
  4. When I try to generate vocabulary and run "python criteo_preprocess.py --input_path "${STORAGE_BUCKET}/criteo_sharded/training/*" --output_path "${STORAGE_BUCKET}/criteo_out/" --temp_dir "${STORAGE_BUCKET}/criteo_vocab/" --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000 --project ${PROJECT} --region ${REGION}", it turns out to be very slow. It takes 30mins when the size of input dataset is 11GB.
  5. I use htop and find that there are three processes of this python command. The utilization of all cores are nearly 0 and only 1 thread is actively running.
  6. I also use shard_rebalancer.py to partition the input dataset to be 64 or 1024. There is no improvement.

3. Steps to reproduce

  1. Input dataset: Training text of Criteo Kaggle, about 11GB. I upload it as Google Cloud Storage in europe-west1. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
  2. Compute Engine c2d-highcpu-32 in europe-west1-b
  3. Specify STORAGE_BUCKET, PROJECT, REGION
  4. Run the python command above.

4. Expected behavior

  • Apache Beam Pipeline can maximize the number of running workers

6. System information

  • OS Platform and Distribution : Linux 6.1.0-18-cloud-amd64 x86_64
  • TensorFlow installed from (source or binary): setup.py
  • TensorFlow version: 2.15.0
  • Python version: 3.9.2

Arith2 avatar Feb 21 '24 16:02 Arith2