_githubsgi comments

Results 46 comments of


                                            _githubsgi

Profiling only a select group of ranks

Sure, I can do a PR.

Hpc setup

As I mentioned above, this is for non-torchrun launchers like mpirun/mpiexec etc. Torchrun is very specific to torch, HPC environments run many other applications which can not use torchrun. HPC...

Hpc setup

@tianyu-l and @fegin , thanks for your replies. Let me try to address the questions one by one below. 1. **Setting variable in run_train**.sh : It does not work because...

Hpc setup

@TJ-Solergibert , it has the same issue as 1 above, if I understand the script.

@TJ-Solergibert , thanks for pointing out the following line for SLURM launcher srun `srun $SRUN_ARGS bash -c "RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID $CMD" 2>&1 | tee -a $LOG_PATH .` The above also would...

Hpc setup

@fegin , I have already addressed the complexity of that approach above. In fact, the approach suggested by @TJ-Solergibert above is simpler and more maintainable. The 2 environment variables I...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

The toml file follows. ` [job] dump_folder = "./outputs_llama4_17bx16e" description = "Llama 4 Scout 17Bx16E training" [profiling] enable_profiling = false save_traces_folder = "profile_trace" profile_freq = 100 [metrics] log_freq = 10...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

@tianyu-l , please see the answers below. what model configs are you using? - The tolml file is above, if that is what you are asking. Otherwise, the source is...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

Looks like checkpoint recompute has issue/s -[ this looks funny](https://github.com/pytorch/pytorch/blob/7f28c03fac11dc3cf37da36def7e0857c331843d/torch/utils/checkpoint.py#L1125) . ``` try: with _recomputation_hook( weakref.ref(frame), gid ), torch.autograd.enable_grad(): frame.recompute_fn(*args) except _StopRecomputationError: pass frame.is_recomputed[gid] = True frame.check_recomputed_tensors_match(gid) ``` One interesting...

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass."

Interesting pointer. Both the original and recompute in the above sum to 1024, but differ in 2 split locations (99 vs 98 a and 105 vs 106). What does the...