hxdtest
hxdtest
``` (RayWorkerVllm pid=7009) warnings.warn("Initializing zero-element tensors is a no-op") INFO 01-03 15:52:10 llm_engine.py:223] # GPU blocks: 86934, # CPU blocks: 8192 (RayWorkerVllm pid=7009) INFO 01-03 15:52:13 model_runner.py:394] Capturing the model...
The OLMo doesn't work with Python 3.8. This is because cache is not available before Python 3.9 in functools and MutableMapping from collections.abc doesn't work with Python 3.8 as described...
In scritp `scripts/run_with_environment.sh`,`FS_LOCAL_RANK` is set as `RANK`. ``` export RANK=$SLURM_PROCID export FS_LOCAL_RANK=$SLURM_PROCID ``` If the job is not launched by `scripts/run_with_environment.sh` and all ranks share the same filesystem, every local...
I use `python -m torch.distributed.run xxx` to launch the training processes. If `reduce_global_loss` is `True`, only `rank0` reduces global loss and other ranks doesn't reduce. The metrics logging to console...
File "opensora/train/train_videogpt.py", line 55, in train(args, vqvae_args, training_args) File "opensora/train/train_videogpt.py", line 39, in train config = VQVAEConfiguration(**asdict(vqvae_args)) File "/hetero_infer/hanxudong.hxd/Open-Sora-Plan/opensora/models/ae/videobase/vqvae/modeling_vqvae.py", line 718, in __init__ self.config = config File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754,...
Provide definition and suggested usage for NodeGroupResource, lunch_nodes, removed_nodes.
@LouisCastricato Yes, the idea is to give a fully reproducible script for the anyone to train it by themselves and validate. _Originally posted by @reshinthadithyan in https://github.com/CarperAI/trlx/issues/81#issuecomment-1304880268_ I tried to...
In Megatron, I find that the check for `tp_comm_overlap` and `sequence_parallel`。 ``` if args.tp_comm_overlap: assert args.sequence_parallel == True, 'Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled'...