Jérémy Rapin

Results 19 comments of Jérémy Rapin

The way I see it, we can easily get the start time and the preemption times through the logs. Submission time is harder, either we append it to the log...

I would be careful at avoiding extra calls to the cluster, unless everything goes through the watcher

this has been here for quite some time and seems useful, so merging it ;)

Hi! Did it work? I expect it should. We're using this wckey internally to be able to check which jobs are launched through submitit. Sorry it makes a mess :s...

Hi @JJGO ! > Only the task with global_rank 0 gets the checkpointing signal (i.e. the call to .checkpoint) and has the opportunity to return a DelayedSubmission object. That's absolutely...

> From my understanding, this change enables someone not using submitit to still be able to retrieve those environment variables that are normally set by `torchrun`. can torchrun be used...

It's not possible for now but should be easy enough to add if need be. What's your use case though? In case it can be dealt with in any other...

Ok, more questions: Would you need it for all executors or for slurm only? (`setup` is slurm only) Could that be performed in Python or does this need to ba...

typically [ContextDecorator](https://docs.python.org/3/library/contextlib.html#contextlib.ContextDecorator) would seem to be a good fit. What do your logs look like? If you have something robust maybe we can include it as a helper in submitit...

This looks like it can most definitely be framed as a contextmanager decorator indeed. How standard are dstat and dcgmi? I am not familiar with this. I heard about python...