Wissam Antoun

Results 23 comments of Wissam Antoun

DO you mean something to BERT one-hot embeddings ?https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/on_device_embedding.py#L79

I tried this, although I'm not sure if it's the best implementation ```python def take_along_axis(x, indices): one_hot_indices = tf.one_hot(indices, depth=x.shape[-1], dtype=x.dtype) # [B, S, P, D] => [B, 128, 128,...

@sanchit-gandhi is there a better implementation than mine, without `expand_dims` or `squeeze` since these are unfavorable operations on TPUs

Hey @sanchit-gandhi , I have already tried the exp. numpy function with no improvement at all compared to `gather` with `batch_dims=2`. I also tried going up to sequence length of...

> ```python > def take_along_axis(x, indices): > > one_hot_indices = tf.one_hot(indices, depth=x.shape[-1], dtype=x.dtype) # [B, S, P, D] => [B, 128, 128, 512] > > # [B, S, P, D]...

@gante I tested the `tf.einsum` implementation. It gave me the same performance as the `one_hot` trick, which is about ~120 seq/second. I tried it with different batch sizes but still...

Yeah this is a weird and unexpected bug. Do you know someone we can get in contact with from Google's XLA or TPU team? And thanks a lot for the...

I think the best fix would be to do a single step at first when `first_batch` is true. I'll write a pull-request asap with my suggested fix.

I was trying to pretrain DeBERTav2 with RTD objective (but without the Gradient-Disentangled Emb. sharing). I noticed that it runs way slower than electra (which is bert based). I tried...

@stefan-it I got all the pretraining code working for Debertav3 except the Gradient-Disentangled Embedding Sharing thing. Which I guess is one of the main contributions of debertav3. Do you have...