community
community copied to clipboard
RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.
Sparse Domain Isolation for supporting large-scale Recommender Systems.
Status | Draft |
---|---|
Author(s) | Haidong Rong ([email protected]) Yafei Zhang([email protected]) Jiandong Wang([email protected]) Chuan Cheng([email protected]) |
Reviewers(s) | Alexandre Passos([email protected]) Bairen Yi([email protected]) |
Sponsor | Yuefeng Zhou ([email protected]) Zhenyu Tan ([email protected]) |
Updated | 2020-09-16 |
@yuefengz @byronyi
Hi,
This is the RFC of Sparse Domain Isolation for supporting large-scale Recommender Systems.
It ’s still a draft. We will update the latest content as soon as possible, we can improve on this basis. In order to push forward as soon as possible, I first submitted here but the owners are everyone who participated in the discussion in the past, and we will complete the list later.
@byronyi If we are going to contribute to addon first, do we need a RFC here?
Since this RFC targets for SIG AddOns, add SIG AddOns leads @facaiy @seanpmorgan and TF sponsor @karmel as reviewers.
@byronyi If we are going to contribute to addon first, do we need a RFC here?
I guess the design was originally targeted to TF core.
As @alextp said, if part of it still requires changes to TF core, then we still need a (probably smaller) RFC here.
It requires changes to core that we should discuss now. From my point of view the most important feature tf core can offer here is allowing experimentation and development of this type of problem (for which there is very high demand at least in industry) to happen without needing to involve tf core.
Separately from that I think the design of the actual components here has many interesting parts, and a fairly close version of these components to what is proposed here should be in core, but I think it's more important now that we make core properly extensible than that we debate the details of this component.
On Thu, Apr 30, 2020 at 10:56 AM Bairen Yi [email protected] wrote:
@byronyi https://github.com/byronyi If we are going to contribute to addon first, do we need a RFC here?
I guess the design was originally targeted to TF core.
As @alextp https://github.com/alextp said, if part of it still requires changes to TF core, then we still need a (probably smaller) RFC here.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/community/pull/237#issuecomment-622008574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABHRI2TS4DEYRY3FU2IBDRPG3TXANCNFSM4MQXNN6A .
--
- Alex
That's a very interesting proposal.
From a high level view (and I'm probably wrong) it looks like it proposes a new type of variable and a new type of optimizer which can update that variable. Given that this is the case I think we can implement this in addons or some other SIG package as long as there are APIs in core TF to ensure that this variable can declare itself checkpointable, be tracked by something like tf.Module / keras.Model (so you can do model.trainable_sparse_variables), and maybe be automatically watched via the gradient tape.
Can you expand the document to clarify the details of these changes to existing parts of TF as opposed to most of the content which is on the new types?
Thanks!
Thank you, In fact, My initial idea was to encapsulate some kind of ResourceVariable backed Hashtable, as we know TF is not good at training any non tf.Variable. I reuse lookup.MutableHashTable because I don't like to write a new hash lib in TF , especially, lookup.XX support checkpointable and deployable on tf.distribute.Server. Here is the compare based on v1.15.2 shows that the range of core effected by the RFC: https://github.com/tensorflow/tensorflow/compare/v1.15.2...rhdong:rfc?expand=1
The main changes:
- supporting the random initiallizer on lookup.MutableHashTable.Find
- Four stateful optimizers(Adagrad, Adam, FTRL, Momentum) adaptation.(Maybe cancelled in new scheme)
Thanks!
The change to the existing SparseApply* kernels which removes Ref(T) from the signature is backwards incompatible and can't be done.
Adding new kernels for the hash apply is fine, though.
I do wonder if we need the Optimizer method _apply_dense_hash or whether we can use a separate optimizer-like class which knows about the hash application. This has the advantage that it naturally covers the use cases where people want different optimizers for the really sparse embedding layers (which I think is relatively common).
On Mon, May 4, 2020 at 10:17 AM rhdong [email protected] wrote:
That's a very interesting proposal.
From a high level view (and I'm probably wrong) it looks like it proposes a new type of variable and a new type of optimizer which can update that variable. Given that this is the case I think we can implement this in addons or some other SIG package as long as there are APIs in core TF to ensure that this variable can declare itself checkpointable, be tracked by something like tf.Module / keras.Model (so you can do model.trainable_sparse_variables), and maybe be automatically watched via the gradient tape.
Can you expand the document to clarify the details of these changes to existing parts of TF as opposed to most of the content which is on the new types?
Thanks!
Thank Alex, In fact, My initial idea was to encapsulate a some kind of ResourceVariable backed Hashtable, as we know TF is not good at training any non tf.Variable. I reuse lookup.MutableHashTable because I don't like to write a new hash lib in TF , especially, lookup.XX support checkpointable and deployable on tf.distribute.Server. Here is the compare based on v1.15.2 shows that the range of core effected by the RFC:
https://github.com/tensorflow/tensorflow/compare/v1.15.2...rhdong:rfc?expand=1
The main changes:
- supporting the random initiallizer on lookup.MutableHashTable.Find
- Four stateful optimizers(Adagrad, Adam, FTRL, Momentum) adaptation.(Maybe cancelled in new schema)
Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/community/pull/237#issuecomment-623592882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABHRMN3Q36LC7PKGV6URDRP32D7ANCNFSM4MQXNN6A .
--
- Alex
The change to the existing SparseApply* kernels which removes Ref(T) from the signature is backwards incompatible and can't be done. Adding new kernels for the hash apply is fine, though. I do wonder if we need the Optimizer method _apply_dense_hash or whether we can use a separate optimizer-like class which knows about the hash application. This has the advantage that it naturally covers the use cases where people want different optimizers for the really sparse embedding layers (which I think is relatively common). … On Mon, May 4, 2020 at 10:17 AM rhdong @.***> wrote: That's a very interesting proposal. From a high level view (and I'm probably wrong) it looks like it proposes a new type of variable and a new type of optimizer which can update that variable. Given that this is the case I think we can implement this in addons or some other SIG package as long as there are APIs in core TF to ensure that this variable can declare itself checkpointable, be tracked by something like tf.Module / keras.Model (so you can do model.trainable_sparse_variables), and maybe be automatically watched via the gradient tape. Can you expand the document to clarify the details of these changes to existing parts of TF as opposed to most of the content which is on the new types? Thanks! Thank Alex, In fact, My initial idea was to encapsulate a some kind of ResourceVariable backed Hashtable, as we know TF is not good at training any non tf.Variable. I reuse lookup.MutableHashTable because I don't like to write a new hash lib in TF , especially, lookup.XX support checkpointable and deployable on tf.distribute.Server. Here is the compare based on v1.15.2 shows that the range of core effected by the RFC: https://github.com/tensorflow/tensorflow/compare/v1.15.2...rhdong:rfc?expand=1 The main changes: 1. supporting the random initiallizer on lookup.MutableHashTable.Find 2. Four stateful optimizers(Adagrad, Adam, FTRL, Momentum) adaptation.(Maybe cancelled in new schema) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABHRMN3Q36LC7PKGV6URDRP32D7ANCNFSM4MQXNN6A . -- - Alex
Yes, you're right, this is only a temp version. I have changed the name to _apply_dense_unstateful, XX_hash is a bad name. About seperate optimizer class, I'm not sure which option would be better, I prefer to use the same optimizer to provide a consistent experience for algorithm engineers, because a model in deep learning RecSys may contain dense weights and sparse weights at the same time..
I think TensorFlow can provide a way to extend optimizers so that you can extend existing optimizers to handle your sparse weights.
+1 to Yuefeng's suggestion.
Can this proposal be enhanced with a section discussing such extension?
On Mon, May 4, 2020 at 12:14 PM Yuefeng Zhou [email protected] wrote:
I think TensorFlow can provide a way to extend optimizers so that you can extend existing optimizers to handle your sparse weights.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/community/pull/237#issuecomment-623651955, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABHRPX63755ZBUA6BC5ODRP4HXRANCNFSM4MQXNN6A .
--
- Alex
I think TensorFlow can provide a way to extend optimizers so that you can extend existing optimizers to handle your sparse weights.
cc @omalleyt12 who proposes the new customizable optimizer in #234. Mind to shed some light on this?
@yuefengz @byronyi @alextp @smilingday @facaiy @seanpmorgan @omalleyt12 Hi all, I just commit an important update for optimizer reusing scheme based on ResourceVariable and come up API detailed design. And I will provide a runnable demo on docker.io as soon as possible. Thank you.
I think this version scheme is simple and natural enough for core.
Since this RFC targets for SIG AddOns, add SIG AddOns leads @facaiy @seanpmorgan and TF sponsor @karmel as reviewers.
Thanks for ping me, @yuefengz @smilingday . The proposal is very interesting. I'm wondering if we can introduce a new kind of Variable class and reuse all existing optimizers (in tf-core or tf addons).
I'm afraid the proposal goes out of scope of tf-addons, so I suggest to put them in a separate repo first. @seanpmorgan Sean, what do you think?
Sean has discussed with SIG AddOns meetings and replied in seperate email threads that tf-addons might not be a good fit. We are still exploring the right place for those contributions.
I will provide the source code with unittest cases sooner.
Is this RFC related to the recently proposed paper "DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications" by Google? https://arxiv.org/pdf/2004.08366.pdf
Is this RFC related to the recently proposed paper "DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications" by Google? https://arxiv.org/pdf/2004.08366.pdf
No, this is a different scheme proposed in an earlier paper Distributed Equivalent Substitution Training for Large-Scale Recommender Systems(accepted by SIGIR'2020).
@yuefengz @tanzhenyu @byronyi @alextp Hi, I just updated this RFC and this update contains some key features include the scheme of compatible with all tf.initializer
without hacking too much on MutableHashTableOfTensors::Find
and we also provided the our patch to core https://github.com/tensorflow/tensorflow/pull/41371, please help us improve it, thank you!
is it compatible with tensorflow serving ? @rhdong
is it compatible with tensorflow serving ? @rhdong
Yes
Hi @rhdong , I fix some bugs(shape of TrainableWrapper) and build tf 2.4.0, based on your code. It seems the dynamic_embedding didn't updated in training process.
Code as follows:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Lambda
from tensorflow import dynamic_embedding as de
import numpy as np
idx = np.random.randint(0, 10, 100)
label = np.array([1.0 if a % 2 == 0 else 0.0 for a in idx], dtype=np.float32)
class MyModel(tf.keras.Model):
def __init__(self):
super(MyModel, self).__init__()
self.w = de.get_variable(name="dynamic_embeddings", dim=8, initializer=np.random.random(8))
self.d0 = Lambda(lambda x: de.embedding_lookup(params=self.w, ids=x, name="wide-sparse-weights"))
self.d1 = Dense(10, activation='relu')
self.d2 = Dense(1, activation='sigmoid')
self.x0 = None
def call(self, x):
self.x0 = self.d0(x)
x1 = self.d1(self.x0)
return self.d2(x1)
model = MyModel()
loss_func = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adagrad(learning_rate=.5)
train_loss = tf.keras.metrics.Mean(name='train_loss')
def train_step(x, label, print_loss=False):
with tf.GradientTape() as tape:
logits = model(x)
loss = loss_func(logits, label)
trainable_weights = model.trainable_variables
# trainable_weights.append(model.x0)
grads = tape.gradient(loss, trainable_weights)
optimizer.apply_gradients(zip(grads, trainable_weights))
if print_loss:
print("loss:{}".format(train_loss(loss).numpy()))
def emb_sum():
a = de.embedding_lookup(params=model.w, ids=np.array([2, 3]), name="wide-sparse-weights")
return a.numpy().sum()
def kernel_sum():
return model.d1.kernel.numpy().sum()
print("emb sum:{}".format(emb_sum()))
for i in range(20):
train_step(idx.reshape(100, 1), label.reshape(100, 1))
print("emb sum:{}".format(emb_sum()))
print("kernel sum:{}".format(kernel_sum()))
# train more
for i in range(10):
train_step(idx.reshape(100, 1), label.reshape(100, 1), print_loss=True)
print("emb sum:{}".format(emb_sum()))
print("kernel sum:{}".format(kernel_sum()))
# print trainable weights
print([v.name for v in model.trainable_weights])
console:
emb sum:**-0.031497083604335785**
emb sum:**-0.031497083604335785**
kernel sum:0.6821714043617249
loss:7.522636
loss:7.52227
loss:7.5219383
loss:7.521633
loss:7.521351
loss:7.521089
loss:7.520846
loss:7.5206184
loss:7.5204053
loss:7.5202055
emb sum:**-0.031497083604335785**
kernel sum:0.6808109283447266
['my_model/dense/kernel:0', 'my_model/dense/bias:0', 'my_model/dense_1/kernel:0',
'my_model/dense_1/bias:0', 'my_model/lambda/TrainableWrapper:0']
Hi @rhdong , I fix some bugs(shape of TrainableWrapper) and build tf 2.4.0, based on your code. It seems the dynamic_embedding didn't updated in training process.
Code as follows:
import tensorflow as tf from tensorflow.keras.layers import Dense, Lambda from tensorflow import dynamic_embedding as de import numpy as np idx = np.random.randint(0, 10, 100) label = np.array([1.0 if a % 2 == 0 else 0.0 for a in idx], dtype=np.float32) class MyModel(tf.keras.Model): def __init__(self): super(MyModel, self).__init__() self.w = de.get_variable(name="dynamic_embeddings", dim=8, initializer=np.random.random(8)) self.d0 = Lambda(lambda x: de.embedding_lookup(params=self.w, ids=x, name="wide-sparse-weights")) self.d1 = Dense(10, activation='relu') self.d2 = Dense(1, activation='sigmoid') self.x0 = None def call(self, x): self.x0 = self.d0(x) x1 = self.d1(self.x0) return self.d2(x1) model = MyModel() loss_func = tf.keras.losses.BinaryCrossentropy() optimizer = tf.keras.optimizers.Adagrad(learning_rate=.5) train_loss = tf.keras.metrics.Mean(name='train_loss') def train_step(x, label, print_loss=False): with tf.GradientTape() as tape: logits = model(x) loss = loss_func(logits, label) trainable_weights = model.trainable_variables # trainable_weights.append(model.x0) grads = tape.gradient(loss, trainable_weights) optimizer.apply_gradients(zip(grads, trainable_weights)) if print_loss: print("loss:{}".format(train_loss(loss).numpy())) def emb_sum(): a = de.embedding_lookup(params=model.w, ids=np.array([2, 3]), name="wide-sparse-weights") return a.numpy().sum() def kernel_sum(): return model.d1.kernel.numpy().sum() print("emb sum:{}".format(emb_sum())) for i in range(20): train_step(idx.reshape(100, 1), label.reshape(100, 1)) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # train more for i in range(10): train_step(idx.reshape(100, 1), label.reshape(100, 1), print_loss=True) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # print trainable weights print([v.name for v in model.trainable_weights])
console:
emb sum:**-0.031497083604335785** emb sum:**-0.031497083604335785** kernel sum:0.6821714043617249 loss:7.522636 loss:7.52227 loss:7.5219383 loss:7.521633 loss:7.521351 loss:7.521089 loss:7.520846 loss:7.5206184 loss:7.5204053 loss:7.5202055 emb sum:**-0.031497083604335785** kernel sum:0.6808109283447266 ['my_model/dense/kernel:0', 'my_model/dense/bias:0', 'my_model/dense_1/kernel:0', 'my_model/dense_1/bias:0', 'my_model/lambda/TrainableWrapper:0']
@shenbaise Thank you for feedback, I will check and fix it as soon as possible.
Hi @rhdong , I fix some bugs(shape of TrainableWrapper) and build tf 2.4.0, based on your code. It seems the dynamic_embedding didn't updated in training process. Code as follows:
import tensorflow as tf from tensorflow.keras.layers import Dense, Lambda from tensorflow import dynamic_embedding as de import numpy as np idx = np.random.randint(0, 10, 100) label = np.array([1.0 if a % 2 == 0 else 0.0 for a in idx], dtype=np.float32) class MyModel(tf.keras.Model): def __init__(self): super(MyModel, self).__init__() self.w = de.get_variable(name="dynamic_embeddings", dim=8, initializer=np.random.random(8)) self.d0 = Lambda(lambda x: de.embedding_lookup(params=self.w, ids=x, name="wide-sparse-weights")) self.d1 = Dense(10, activation='relu') self.d2 = Dense(1, activation='sigmoid') self.x0 = None def call(self, x): self.x0 = self.d0(x) x1 = self.d1(self.x0) return self.d2(x1) model = MyModel() loss_func = tf.keras.losses.BinaryCrossentropy() optimizer = tf.keras.optimizers.Adagrad(learning_rate=.5) train_loss = tf.keras.metrics.Mean(name='train_loss') def train_step(x, label, print_loss=False): with tf.GradientTape() as tape: logits = model(x) loss = loss_func(logits, label) trainable_weights = model.trainable_variables # trainable_weights.append(model.x0) grads = tape.gradient(loss, trainable_weights) optimizer.apply_gradients(zip(grads, trainable_weights)) if print_loss: print("loss:{}".format(train_loss(loss).numpy())) def emb_sum(): a = de.embedding_lookup(params=model.w, ids=np.array([2, 3]), name="wide-sparse-weights") return a.numpy().sum() def kernel_sum(): return model.d1.kernel.numpy().sum() print("emb sum:{}".format(emb_sum())) for i in range(20): train_step(idx.reshape(100, 1), label.reshape(100, 1)) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # train more for i in range(10): train_step(idx.reshape(100, 1), label.reshape(100, 1), print_loss=True) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # print trainable weights print([v.name for v in model.trainable_weights])
console:
emb sum:**-0.031497083604335785** emb sum:**-0.031497083604335785** kernel sum:0.6821714043617249 loss:7.522636 loss:7.52227 loss:7.5219383 loss:7.521633 loss:7.521351 loss:7.521089 loss:7.520846 loss:7.5206184 loss:7.5204053 loss:7.5202055 emb sum:**-0.031497083604335785** kernel sum:0.6808109283447266 ['my_model/dense/kernel:0', 'my_model/dense/bias:0', 'my_model/dense_1/kernel:0', 'my_model/dense_1/bias:0', 'my_model/lambda/TrainableWrapper:0']
@shenbaise Thank you for feedback, I will check and fix it as soon as possible.
Hi @shenbaise , the reason is that the commit is not compatible with keras, especially the optimizer v2, I need two days to fix it and add the UT cases, please wait a moment, Thank you!
Hi @rhdong , I fix some bugs(shape of TrainableWrapper) and build tf 2.4.0, based on your code. It seems the dynamic_embedding didn't updated in training process.
Code as follows:
import tensorflow as tf from tensorflow.keras.layers import Dense, Lambda from tensorflow import dynamic_embedding as de import numpy as np idx = np.random.randint(0, 10, 100) label = np.array([1.0 if a % 2 == 0 else 0.0 for a in idx], dtype=np.float32) class MyModel(tf.keras.Model): def __init__(self): super(MyModel, self).__init__() self.w = de.get_variable(name="dynamic_embeddings", dim=8, initializer=np.random.random(8)) self.d0 = Lambda(lambda x: de.embedding_lookup(params=self.w, ids=x, name="wide-sparse-weights")) self.d1 = Dense(10, activation='relu') self.d2 = Dense(1, activation='sigmoid') self.x0 = None def call(self, x): self.x0 = self.d0(x) x1 = self.d1(self.x0) return self.d2(x1) model = MyModel() loss_func = tf.keras.losses.BinaryCrossentropy() optimizer = tf.keras.optimizers.Adagrad(learning_rate=.5) train_loss = tf.keras.metrics.Mean(name='train_loss') def train_step(x, label, print_loss=False): with tf.GradientTape() as tape: logits = model(x) loss = loss_func(logits, label) trainable_weights = model.trainable_variables # trainable_weights.append(model.x0) grads = tape.gradient(loss, trainable_weights) optimizer.apply_gradients(zip(grads, trainable_weights)) if print_loss: print("loss:{}".format(train_loss(loss).numpy())) def emb_sum(): a = de.embedding_lookup(params=model.w, ids=np.array([2, 3]), name="wide-sparse-weights") return a.numpy().sum() def kernel_sum(): return model.d1.kernel.numpy().sum() print("emb sum:{}".format(emb_sum())) for i in range(20): train_step(idx.reshape(100, 1), label.reshape(100, 1)) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # train more for i in range(10): train_step(idx.reshape(100, 1), label.reshape(100, 1), print_loss=True) print("emb sum:{}".format(emb_sum())) print("kernel sum:{}".format(kernel_sum())) # print trainable weights print([v.name for v in model.trainable_weights])
console:
emb sum:**-0.031497083604335785** emb sum:**-0.031497083604335785** kernel sum:0.6821714043617249 loss:7.522636 loss:7.52227 loss:7.5219383 loss:7.521633 loss:7.521351 loss:7.521089 loss:7.520846 loss:7.5206184 loss:7.5204053 loss:7.5202055 emb sum:**-0.031497083604335785** kernel sum:0.6808109283447266 ['my_model/dense/kernel:0', 'my_model/dense/bias:0', 'my_model/dense_1/kernel:0', 'my_model/dense_1/bias:0', 'my_model/lambda/TrainableWrapper:0']
Hi @shenbaise, I fix the issue and the commit is here
FYI, @kttian wrote a prototype for a differentiable hash map, roughly the equivalent of TensorList, as part of her internship project. Here's a colab that demonstrates direct gradient updates: https://colab.sandbox.google.com/drive/1hyFmriuq4Bz61_rxg2bfdE_jXHVfX8Rr?usp=sharing#scrollTo=8HDUUBEFAesC
There may be an opportunity to join efforts on a core implementation.
@alextp @saxenasaurabh @dynamicwebpaige
FYI, @kttian wrote a prototype for a differentiable hash map, roughly the equivalent of TensorList, as part of her internship project. Here's a colab that demonstrates direct gradient updates: https://colab.sandbox.google.com/drive/1hyFmriuq4Bz61_rxg2bfdE_jXHVfX8Rr?usp=sharing#scrollTo=8HDUUBEFAesC
There may be an opportunity to join efforts on a core implementation.
@alextp @saxenasaurabh @dynamicwebpaige
This is good job. But I think it is difficult to make the hash map trainable .
FYI, @kttian wrote a prototype for a differentiable hash map, roughly the equivalent of TensorList, as part of her internship project. Here's a colab that demonstrates direct gradient updates: https://colab.sandbox.google.com/drive/1hyFmriuq4Bz61_rxg2bfdE_jXHVfX8Rr?usp=sharing#scrollTo=8HDUUBEFAesC There may be an opportunity to join efforts on a core implementation. @alextp @saxenasaurabh @dynamicwebpaige
This is good job. But I think it is difficult to make the hash map trainable .
It already is trainable (at least in the sense of trainable that I believe you're referring to).
@yuefengz Is this still in draft mode? What are the plans with this RFC?