_githubsgi
_githubsgi
Sure, I can do a PR.
As I mentioned above, this is for non-torchrun launchers like mpirun/mpiexec etc. Torchrun is very specific to torch, HPC environments run many other applications which can not use torchrun. HPC...
@tianyu-l and @fegin , thanks for your replies. Let me try to address the questions one by one below. 1. **Setting variable in run_train**.sh : It does not work because...
@TJ-Solergibert , it has the same issue as 1 above, if I understand the script.
@TJ-Solergibert , thanks for pointing out the following line for SLURM launcher srun `srun $SRUN_ARGS bash -c "RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID $CMD" 2>&1 | tee -a $LOG_PATH .` The above also would...
@fegin , I have already addressed the complexity of that approach above. In fact, the approach suggested by @TJ-Solergibert above is simpler and more maintainable. The 2 environment variables I...
The toml file follows. ` [job] dump_folder = "./outputs_llama4_17bx16e" description = "Llama 4 Scout 17Bx16E training" [profiling] enable_profiling = false save_traces_folder = "profile_trace" profile_freq = 100 [metrics] log_freq = 10...
@tianyu-l , please see the answers below. what model configs are you using? - The tolml file is above, if that is what you are asking. Otherwise, the source is...
Looks like checkpoint recompute has issue/s -[ this looks funny](https://github.com/pytorch/pytorch/blob/7f28c03fac11dc3cf37da36def7e0857c331843d/torch/utils/checkpoint.py#L1125) . ``` try: with _recomputation_hook( weakref.ref(frame), gid ), torch.autograd.enable_grad(): frame.recompute_fn(*args) except _StopRecomputationError: pass frame.is_recomputed[gid] = True frame.check_recomputed_tensors_match(gid) ``` One interesting...
Interesting pointer. Both the original and recompute in the above sum to 1024, but differ in 2 split locations (99 vs 98 a and 105 vs 106). What does the...