Jacob Danovitch

Results 32 comments of Jacob Danovitch

Took a holiday break from this while our cluster was down for maintenance for a bit. Turns out that checkpointing/barrier issue might be more complicated than I thought, but not...

Ah I think I see the real issue here. It's not the logging itself hanging. 1. (All ranks) My trainer tells my checkpointer to save **if** it's the master process...

> Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right? Their typical training loop is something like ([source](https://www.deepspeed.ai/getting-started/#model-checkpointing)):...

Yeah that should work perfectly, I'll give it a try.

> I think this would be way easier if we would just return the edge indices in `random_walk` :( I guess your approach works, but will be indeed inefficient. Yeah,...

> If you are interested, we can add support for this in [`pyg-lib`](https://pyg-lib.readthedocs.io/en/latest/modules/sampler.html#pyg_lib.sampler.random_walk), which should be straightforward to add. It also supports nightly builds so it should be ready to...

> @jacobdanovitch Your example code will now simplify as follows: Works perfectly, thanks so much @pbielak!

> It's not public yet. But I have just provided you read access. you may take a look, thanks Access to this would be greatly appreciated as well, thank you!

Can vouch for using `dask-mpi` within AML jobs, it works flawlessly. Haven't had a single issue so far.

@jacobtomlinson For sure, what kind of example would be helpful? There wasn't anything extra I had to do to make it work, I just used the mpi distribution in AML...