ray [AIR] Improve tensor movement from host RAM to GPU

[AIR] Improve tensor movement from host RAM to GPU

Open jiaodong opened this issue 2 years ago • 1 comments

Description

In our recent profiling of a production recsys model using TF with the following data:

    def to_tf_dataset(dataset: ray.data.Dataset, batch_size):
        def to_tensor_iterator():
            for batch in dataset.iter_tf_batches(
                batch_size=batch_size,
                prefetch_blocks=prefetch_blocks
            ):
                yield {k: v for k, v in batch.items() if k in feature_columns}, batch["label"]

        output_signature = (
            output_signatures,
            tf.TensorSpec(shape=(None, C.twotower_config.N_MAX_ORDER_SIZE), dtype=tf.int64, name='label'),
        )
        tf_dataset = tf.data.Dataset.from_generator(
            to_tensor_iterator, output_signature=output_signature
        )
        return prepare_dataset_shard(tf_dataset)

tf.data.Dataset.from_generator is responsible for moving data from host memory to GPU. With dummy for loop consumer we can see our batch delay time doubled between iter_batches (numpy) and iter_tf_batches(tf.Tensor)

P50/P95/Max batch delay (s) 3.8185287429996606 11.332880855199615 15.841198067999358

P50/P95/Max batch delay (s) 1.9998468630001298 6.8852869076003085 10.068932280000809

This should be applicable to TF / Pytorch, for both training and inference

Relevant resources:

https://github.com/tensorflow/tensorflow/issues/43905 https://github.com/tensorflow/tensorflow/issues/44836 https://www.tensorflow.org/guide/data_performance

Use case

No response

Sep 01 '22 16:09 jiaodong

Tensorflow prefetching may be sufficient.

Sep 20 '22 17:09 matthewdeng

ray ray copied to clipboard

[AIR] Improve tensor movement from host RAM to GPU

Description

Use case

ray
ray copied to clipboard