HeKa

Results 24 issues of HeKa

I don't know what happened, is the calculation precision and parameter precision not set correctly? Deepspeed or Megatron could achieve 55% MFU easily with same machine. Here is my bash...

For example in single A100 machine. Llama2 13B training speed with TP2 DP 4 + Zero1 is more faster than FSDP.

There is only FP16/BF16 being supported in class FusedAttention.

I am a developer of tensorflow [recommenders-addons](https://github.com/tensorflow/recommenders-addons) and I now need to develop an all-to-all embedding layer for multi-GPU distributed training of recommendation models. The old tensorflow distributed strategy clearly...

type:support

1. Like tf.raw_ops.Bucketize input=[123.5, 0.7, 10.3, 100.6, 11.7] boundaries=[1,11,111] output=[3,0,1,2,2] 2. Like tf.dynamic_partition which also torch.ops.fbgemm.block_bucketize_sparse_features in torch input=[123.5, 0.7, 10.3, 100.6, 11.7] partition=[1,0,2,0,1] output=[0.7, 100.6] and [123.5, 11.7] and...

question

I need to build a model like this: There is a very large distributed dynamic shape Embedding, which can be seen as a hash table. In every DP rank, when...

question
needs info

> jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: No registered implementation for custom call to te_scaled_upper_triang_masked_softmax_forward for platform CUDA ```python from transformer_engine.jax.flax.transformer import DotProductAttention, MultiHeadAttention, TransformerLayer ``` When I use TE flax layer, all of...

bug
jax

Not GPipe. Run pipeline forward meanwhile backward.

feature request

matrix format:https://jax.readthedocs.io/en/latest/_autosummary/jax.experimental.sparse.BCSR.html#jax.experimental.sparse.BCSR some feature in deepspeed: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp

If I understand correctly, autoregressive model has a loss, and also multi-task dense layers followed autoregressive model has a weighted loss. How to combine them? And in ranking model, how...