Alex Morehead comments

Results 88 comments of


                                            Alex Morehead

Add optional `start_method` argument to `_cli.py`

I've also opened a PR for [torchtnt](https://github.com/pytorch/tnt/pull/1030) that will make multi-GPU (local) training work when `scheduler.start_method=fork`! This allows one to train on a non-SLURM cluster with multiple GPUs and multiple...

Add optional `start_method` argument to `_cli.py`

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: `malloc(): invalid next size (unsorted)`. This...

Add optional `start_method` argument to `_cli.py`

@rayg1234, I just discovered that I can fix the segfault error (with `cluster.mode=SLURM`) I mentioned before by `pip install`ing `fairchem==2.4.0` (versus the latest commit in `main`) which installs `lmdb==1.7.0` vs....

Add optional `start_method` argument to `_cli.py`

Thanks for the update, @misko!

Step counter reset upon resuming training job

@arunraja-hub, would you be able to describe what you originally saw regarding this issue? I've seen it myself every time I try to point `fairchem -c` to a (timed-out) training...

FusedAttention Bug (at backward with causal=False)

I'm pretty sure this is because `backward` (regardless of the value of the `causal` argument) is skipping the upper-right (triangular) tokens, effectively assuming the upper-right pairwise token features do not...

FusedAttention Bug (at backward with causal=False)

Hi, @t0278611. The reason I noticed this discrepancy is that I recently implemented my own version of the forward and backward pass, now with support for arbitrary `attn_mask` arguments. Feel...

FusedAttention Bug (at backward with causal=False)

Hi, @t0278611. Since the changes I made to the Flash Attention kernel (to get JVP/HVP support) were quite substantial, I think it warrants having the code in a separate (maintained)...