hxdtest issues

Results 20 issues of


                                            hxdtest

CUDA error when loading mixtral model

``` (RayWorkerVllm pid=7009) warnings.warn("Initializing zero-element tensors is a no-op") INFO 01-03 15:52:10 llm_engine.py:223] # GPU blocks: 86934, # CPU blocks: 8192 (RayWorkerVllm pid=7009) INFO 01-03 15:52:13 model_runner.py:394] Capturing the model...

Work with Python 3.8

The OLMo doesn't work with Python 3.8. This is because cache is not available before Python 3.9 in functools and MutableMapping from collections.abc doesn't work with Python 3.8 as described...

Set fs_local_rank as global_rank when FS_LOCAL_RANK is not available

In scritp `scripts/run_with_environment.sh`，`FS_LOCAL_RANK` is set as `RANK`. ``` export RANK=$SLURM_PROCID export FS_LOCAL_RANK=$SLURM_PROCID ``` If the job is not launched by `scripts/run_with_environment.sh` and all ranks share the same filesystem, every local...

Only rank0 log metrics to console

I use `python -m torch.distributed.run xxx` to launch the training processes. If `reduce_global_loss` is `True`, only `rank0` reduces global loss and other ranks doesn't reduce. The metrics logging to console...

Can't set attribute in modeling_vqvae

File "opensora/train/train_videogpt.py", line 55, in train(args, vqvae_args, training_args) File "opensora/train/train_videogpt.py", line 39, in train config = VQVAEConfiguration(**asdict(vqvae_args)) File "/hetero_infer/hanxudong.hxd/Open-Sora-Plan/opensora/models/ae/videobase/vqvae/modeling_vqvae.py", line 718, in __init__ self.config = config File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754,...

We are going to build a LLM training agent help searching training strategy and babysitting model training to maxmimize MFU and effective training time.

Use ray `placement group` for resource allocation and actor scheduling

Provide definition and suggested usage for NodeGroupResource, lunch_nodes, removed_nodes in Scaler.

Provide definition and suggested usage for NodeGroupResource, lunch_nodes, removed_nodes.

@LouisCastricato Yes, the idea is to give a fully reproducible script for the anyone to train it by themselves and validate.

@LouisCastricato Yes, the idea is to give a fully reproducible script for the anyone to train it by themselves and validate. _Originally posted by @reshinthadithyan in https://github.com/CarperAI/trlx/issues/81#issuecomment-1304880268_ I tried to...

[Question] Why Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled?

In Megatron, I find that the check for `tp_comm_overlap` and `sequence_parallel`。 ``` if args.tp_comm_overlap: assert args.sequence_parallel == True, 'Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled'...

enhancement