HeKa issues

Results 24 issues of


                                            HeKa

[Question] Very low MFU(30%~35%) when train bf16 Llama2 and GPT model with single SXM4 A100 machine.

I don't know what happened, is the calculation precision and parameter precision not set correctly? Deepspeed or Megatron could achieve 55% MFU easily with same machine. Here is my bash...

[Feature Request] Need ZeRo-1/2 to cooperate with PP+TP+DP. Which may more faster than FSDP sometimes.

For example in single A100 machine. Llama2 13B training speed with TP2 DP 4 + Zero1 is more faster than FSDP.

[Feature Request] Any roadmap for supporting FP8 attention calculation?

There is only FP16/BF16 being supported in class FusedAttention.

Think about Keras3 preference, should I develop my distributed training CTR model based on Jax or TensorFlow?

I am a developer of tensorflow [recommenders-addons](https://github.com/tensorflow/recommenders-addons) and I now need to develop an all-to-all embedding layer for multi-GPU distributed training of recommendation models. The old tensorflow distributed strategy clearly...

type:support

How to perform a Bucketize and dynamic_partition OP？

1. Like tf.raw_ops.Bucketize input=[123.5, 0.7, 10.3, 100.6, 11.7] boundaries=[1,11,111] output=[3,0,1,2,2] 2. Like tf.dynamic_partition which also torch.ops.fbgemm.block_bucketize_sparse_features in torch input=[123.5, 0.7, 10.3, 100.6, 11.7] partition=[1,0,2,0,1] output=[0.7, 100.6] and [123.5, 11.7] and...

question

Is there a way to control the pjit to dynamically partition the input IDs into different ranks?

I need to build a model like this: There is a very large distributed dynamic shape Embedding, which can be seen as a hash table. In every DP rank, when...

question

needs info

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: No registered implementation for custom call to xxx for platform CUDA

> jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: No registered implementation for custom call to te_scaled_upper_triang_masked_softmax_forward for platform CUDA ```python from transformer_engine.jax.flax.transformer import DotProductAttention, MultiHeadAttention, TransformerLayer ``` When I use TE flax layer, all of...

bug

jax

How to implement 1F1B pipeline parallelism in Jax?

Not GPipe. Run pipeline forward meanwhile backward.

feature request

Support for CSR format sparse matrix in optimizer?

matrix format:https://jax.readthedocs.io/en/latest/_autosummary/jax.experimental.sparse.BCSR.html#jax.experimental.sparse.BCSR some feature in deepspeed: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp

How to combine autoregressive model and classification multi-task head in ranking model?

If I understand correctly, autoregressive model has a loss, and also multi-task dense layers followed autoregressive model has a weighted loss. How to combine them? And in ranking model, how...