Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

convert markdown to pdf

Thank you for validating that it is a good read, @pengzhangzhi

#185 Update Linux install script to refer to the correct relative paths

It probably depends on what's the most recommended way of running things. And it's in flux most of the time as ML evolves. Normally, torchrun (and previously torch.launch) was the...

Save and resume the state of a DataLoader

amazing! Thank you, @lhoestq does it work with non-iterable dataset as well? the docs only mention iterable dataset

Save and resume the state of a DataLoader

Thank you very much for clarifying that, Andrew.

Why does running Lightning on SLURM with python perform worse than with srun?

@Muennighoff, what's inside `train.py`? The way you're launching it in many frameworks leads to DP and not DDP (but perhaps not in PTL). So you might not comparing apples to...

Why does running Lightning on SLURM with python perform worse than with srun?

Thank you for sharing your insights, Konstantin In general the problem is that https://github.com/nviDIA/nemo, which is what we use, abstracts all PTL bits away, giving the user an API that...

Why does running Lightning on SLURM with python perform worse than with srun?

That's an interesting discovery, @schlabrendorff - though I'm not sure this is always the case. At least it doesn't seem to impact my setup I checked I'm using SLURM 22.05.09...

Why does running Lightning on SLURM with python perform worse than with srun?

Ah, that's a possibility since perhaps they have planned to switch in 22.05 but didn't do it until 23.x. When I get a chance I can try the reverse -...

Why does running Lightning on SLURM with python perform worse than with srun?

OK, I tested that w/ or w/o `-exclusive` I get all the available cores in the setup [above](https://github.com/Lightning-AI/pytorch-lightning/issues/18650#issuecomment-1872637017): ``` $ scontrol show -d job 3182_1 | grep CPUs/Task NumNodes=1 NumCPUs=48...

Why does running Lightning on SLURM with python perform worse than with srun?

How do I get `Request Defaults`?