Stas Bekman

Results 664 comments of Stas Bekman

Great idea! I did: ``` #SBATCH --job-name=x #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=0:10:00 srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200" ``` and got: ``` $ scontrol show -d job...

I realized I got my diagnostics wrong. `scontrol show -d job` shows the `sbatch/salloc` setting, it doesn't know anything about `srun`. Using `len(os.sched_getaffinity(0))` should give us the correct diagnostics, as...

You're correct, @srmsoumya - this was a copy-n-paste from `torchrun` setup where it's always one task. Fixed [here](https://github.com/stas00/ml-engineering/blob/master/orchestration/slurm/launchers/lightning-launcher.slurm). thank you very much for the heads up.

I see that it affected a few tests, which fail since the traceback printout got affected. I trust you will know whether the tests need to be adjusted or perhaps...

Yes, someone else has just mentioned to me that %debug is affected, since it relies on the locals() to be set. So more work needs to be done. I don't...

> Thank you so much for investigating and fixing this one! Are you able to update the tests too, to make them less strict about the output? :) First, help...

Found a solution for making `%pdb on` magic work correctly, by simply checking the flag and not stripping then: ``` --- a/IPython/core/interactiveshell.py +++ b/IPython/core/interactiveshell.py @@ -1950,7 +1950,7 @@ def _get_exc_info(self,...

> > Thank you so much for investigating and fixing this one! Are you able to update the tests too, to make them less strict about the output? @rgbkrk, once...

Thank you for your follow up, @fperez, and sharing the history of `%debug` and suggesting a way to move forward. > The question that remains is then what to do...

BTW, through trial and error I learned that the leak had to do with saving the exception object, but this is how I stumbled upon traceback.clear_frames(tb). I saw it quite...