Yonghao Zhuang issues

Results 10 issues of


                                            Yonghao Zhuang

dump with r_addend is not correct

I'm writing an assembler to generate relocatable object file and using dump function of rv8 to debug(btw, its output format is really nice). The file is little endian. But I've...

[WIP][PERF] offload rng computation in remat

In `jax.remat`, constant values and random numbers are generated in the forward part and stored until the backward part. An example is [this](https://gist.github.com/ZYHowell/96e31b8e43ec37a9ddfaac4aa1a559aa). To reduce memory consumption, we remat this...

[BUG] rng runs with each microbatch

Each microbatch runs the traced Jaxpr once. However, some parts of the Jaxpr is not related to microbatch. This results in redundant computations and wrong behavior. For example: ```python def...

known bug

Align Lora script with new conversation and training api

[EXAMPLE] Add llama finetune

[FEATURE] Reduce congestion of sending on one mesh

This can be a starting point to learn `runtime_emitter` and `cross_mesh_resharding`. Background --- In Pipeshard Parallel, when a tensor is required to be received from a mesh, we always chose...

enhancement

good first issue

[BUG] Collective group's rank is incorrect

Background --- Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is: [`create_collective_group`](https://github.com/alpa-projects/alpa/blob/54c585c0e897914d7078d6f0243d12a19d1733f4/alpa/collective/collective.py#L169) or [`init_collective_group`](https://github.com/alpa-projects/alpa/blob/54c585c0e897914d7078d6f0243d12a19d1733f4/alpa/collective/collective.py#L138) from `collective.py` calls: [`create_collective_group`](https://github.com/alpa-projects/alpa/blob/54c585c0e897914d7078d6f0243d12a19d1733f4/alpa/collective/collective.py#L72) of `GroupManager` class...

known bug

good first issue

[FEATURE] launch physical meshes after compilation

This can be a starting point to learn `runtime_emitter`. Background --- In Pipeshard Parallel, the final compilation step is to interpret the solution into [a configuration](https://github.com/alpa-projects/alpa/blob/fcd560d58e680b6d3c5098504242b49f527549ee/alpa/pipeline_parallel/runtime_emitter.py#L228-L255) containing all information about...

enhancement

good first issue

[BUG] Numerical Incorrectness for BERT model training with #497

I'm running the Megatron-LM [BERT example](https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_bert.py) with Wikipedia data, and observed a loss divergence between TE v1.1 and v1.2. I then debug by fixing the Megatron-LM/torch version, and binary searched...

Sequence Parallel

## Motivation When serving an extremely large model (e.g. Llama 400B), the #GPU might be more than #kv head. This leads to a replication on kv cache, which is troublesome...