ray
ray copied to clipboard
[AIR] Improve tensor movement from host RAM to GPU
Description
In our recent profiling of a production recsys model using TF with the following data:
def to_tf_dataset(dataset: ray.data.Dataset, batch_size):
def to_tensor_iterator():
for batch in dataset.iter_tf_batches(
batch_size=batch_size,
prefetch_blocks=prefetch_blocks
):
yield {k: v for k, v in batch.items() if k in feature_columns}, batch["label"]
output_signature = (
output_signatures,
tf.TensorSpec(shape=(None, C.twotower_config.N_MAX_ORDER_SIZE), dtype=tf.int64, name='label'),
)
tf_dataset = tf.data.Dataset.from_generator(
to_tensor_iterator, output_signature=output_signature
)
return prepare_dataset_shard(tf_dataset)
tf.data.Dataset.from_generator
is responsible for moving data from host memory to GPU. With dummy for loop consumer we can see our batch delay time doubled between iter_batches (numpy) and iter_tf_batches(tf.Tensor)
P50/P95/Max batch delay (s) 3.8185287429996606 11.332880855199615 15.841198067999358
P50/P95/Max batch delay (s) 1.9998468630001298 6.8852869076003085 10.068932280000809
This should be applicable to TF / Pytorch, for both training and inference
Relevant resources:
https://github.com/tensorflow/tensorflow/issues/43905 https://github.com/tensorflow/tensorflow/issues/44836 https://www.tensorflow.org/guide/data_performance
Use case
No response
Tensorflow prefetching may be sufficient.