Alex Morehead
Alex Morehead
I've also opened a PR for [torchtnt](https://github.com/pytorch/tnt/pull/1030) that will make multi-GPU (local) training work when `scheduler.start_method=fork`! This allows one to train on a non-SLURM cluster with multiple GPUs and multiple...
I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: `malloc(): invalid next size (unsorted)`. This...
@rayg1234, I just discovered that I can fix the segfault error (with `cluster.mode=SLURM`) I mentioned before by `pip install`ing `fairchem==2.4.0` (versus the latest commit in `main`) which installs `lmdb==1.7.0` vs....
Thanks for the update, @misko!
@arunraja-hub, would you be able to describe what you originally saw regarding this issue? I've seen it myself every time I try to point `fairchem -c` to a (timed-out) training...
I'm pretty sure this is because `backward` (regardless of the value of the `causal` argument) is skipping the upper-right (triangular) tokens, effectively assuming the upper-right pairwise token features do not...
Hi, @t0278611. The reason I noticed this discrepancy is that I recently implemented my own version of the forward and backward pass, now with support for arbitrary `attn_mask` arguments. Feel...
Hi, @t0278611. Since the changes I made to the Flash Attention kernel (to get JVP/HVP support) were quite substantial, I think it warrants having the code in a separate (maintained)...