recommenders
recommenders copied to clipboard
[Question] expected throughput of cloud tpu on embedding lookup?
Hi,
I read this blog recently https://cloud.google.com/blog/topics/developers-practitioners/building-large-scale-recommenders-using-cloud-tpus, very interested in it and wondering the raw performance of TPUEmbedding lookup performance.(we can quite easily get the perf data of tf.nn.(safe)embedding_lookup(_sparse) etc. but it becomes harder to get TPUEmbedding lookup perf data)
Based the test script included in this repo, I wrote piece of benchmarking code to test it:
# Copyright 2022 The TensorFlow Recommenders Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/embedding/tpu_embedding_layer_test.py
# trying to test tpu embedding lookup throughput, while I could not find lower-level API for doing such test
import time
import numpy as np
import tensorflow as tf
from tensorflow_recommenders.layers.embedding import tpu_embedding_layer
TABLE_SIZE = 1000000
EMB_DIM = 128
QUERY_KEY_NUM = 65536 * 8 # * 64, killed as Allocation of xxx exceeds 10% of free system memory.
class TPUEmbeddingLayerTest():
def __init__(self):
self.embedding_values = np.arange(TABLE_SIZE * EMB_DIM, dtype=np.float32)
self.initializer = tf.constant_initializer(self.embedding_values)
self.table_config = tf.tpu.experimental.embedding.TableConfig(
vocabulary_size=TABLE_SIZE,
dim=EMB_DIM,
initializer=self.initializer,
combiner='sum',
name='embedding_table')
self.feature_config = {
'indices2embeddings': tf.tpu.experimental.embedding.FeatureConfig(
table=self.table_config, name='indices2embeddings'),
}
self.batch_size = QUERY_KEY_NUM
self.sample_size = 1
# TODO(pehuang): draw samples randomly by given distribution
self.data_point_indices = np.zeros((self.batch_size, 2), dtype=np.int32)
self.data_point_indices[:, 0] = np.arange(self.batch_size, dtype=np.int32)
self.data_points = np.random.choice(TABLE_SIZE, QUERY_KEY_NUM)
self.embedding_lookup_input_data = tf.SparseTensor(
indices=self.data_point_indices,
values=tf.convert_to_tensor(self.data_points, dtype=tf.int32), # fp64 embedding and int32 key by default?
dense_shape=[self.batch_size, self.sample_size])
self.dataset = tf.data.Dataset.from_tensors({'indices2embeddings': self.embedding_lookup_input_data})
def embedding_lookup_throughput_test(self, optimizer_name='sgd', training=False):
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver('').connect('')
# strategy = tf.distribute.TPUStrategy(resolver)
strategy = tf.distribute.get_strategy() # Use the default strategy.
with strategy.scope():
embedding_layer = tpu_embedding_layer.TPUEmbedding(feature_config=self.feature_config, optimizer=None)
input_args = {'batch_size': self.batch_size,
'shape': (),
'sparse': True,
'dtype': tf.int32}
inputs = {'indices2embeddings': tf.keras.Input(**input_args, name='indices2embeddings')}
embeddings = embedding_layer(inputs)
self.model = tf.keras.Model(inputs=(inputs), outputs=(embeddings))
dist = strategy.experimental_distribute_dataset(self.dataset, options=tf.distribute.InputOptions(experimental_fetch_to_device=False))
dist_iter = iter(dist)
def lookup(features):
res = self.model(features)
return res
# for _ in range(10): # warmup 10 rounds failed after batch data used up, and raised StopIteration exception
# result = strategy.run(lookup, args=(next(dist_iter),))
t_start = time.time()
result = strategy.run(lookup, args=(next(dist_iter),))
t_end = time.time()
# import pdb; pdb.set_trace() # stop to check embeddings
print("embedding throughput: {}GB/s".format((QUERY_KEY_NUM * EMB_DIM) / 1e9 / (t_end - t_start)))
# embedding throughput: 0.04273784880155698GB/s, must be cold-start, D2H and H2D etc. included
# but I cannot warmup/increase table size temporally, using cloud shell provided tpu
if __name__ == '__main__':
test = TPUEmbeddingLayerTest()
test.embedding_lookup_throughput_test()
I used cloud shell to get access to tpu resource as introduced in https://github.com/tensorflow/tpu, but the reported throughput data is quite below expectation(obviously, its due to various issues in my benchmark script), may anyone help correction the script to get benchmark results in the right way? Or just provide some reference data, so I could know what the expected throughput should be?
THX!
What are you hoping to measure with this? I am not sure doing a microbenchmark like this is going to give you very valuable data. You could obtain a more meaningful metric by using both versions of embeddings with a real model, and observing the training performance in both cases.
Bear in mind that TPUEmbedding
lookups will normally be pipelined with TensorCore execution, via the pipeline_execution_with_tensor_core
argument.
What are you hoping to measure with this? I am not sure doing a microbenchmark like this is going to give you very valuable data. You could obtain a more meaningful metric by using both versions of embeddings with a real model, and observing the training performance in both cases.
Bear in mind that
TPUEmbedding
lookups will normally be pipelined with TensorCore execution, via thepipeline_execution_with_tensor_core
argument.
Hi Maciej,
Thx for your suggestion for using pipeline_execution_with_tensor_core
argument. I'm doing the microbenchmarking based on personal interest, to estimate the performance gain of using TPU for model training/inference. I started the benchmarking work from embedding lookup part as embedding lookup efficiency may dominate the overall throughput. I'll try to connect TPUEmbedding lookup with a real model for test later as suggested.
Looks like you are using the default strategy: tf.distribute.get_strategy()
which places your model (including embeddings) on the CPU (not the TPU).
In terms of expected throughput, we see somewhere 1-5 million examples per second on for DLRM model training Criteo 1TB dataset training for TPU v3-8 and TPU v4-8 (TPU v4 being ~ 2 times faster than v3).
This assumes input pipeline is optimized and steps_per_execution >> 1
is used.