Less Wright issues

Results 44 issues of


Less Wright

add unit test for ongoing numerical verification of fusedRMSNorm

have a test in another repo, move it and integrate for torchtrain to ensure no future issues with any changes.

better_engineering

Make fused RMSNorm a registered op

Adding this as tracking issue to unblock https://github.com/pytorch/torchtrain/pull/181 from landing: per @wanchaol : IMO we should also register the fwd/bwd rmsnorm kernel as a PyTorch op, this is so that:...

bug

enhancement

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

As part of e2e training, encountered wild loss curve spikes: After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to...

enhancement

metrics - add L1 gradient norm tracking

to help monitor training stability.

enhancement

consider - enable streaming attention as default for llama models (1-4M context)

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?). "we introduce StreamingLLM, an efficient framework that enables...

enhancement

add 'dry run' flag - one iter, no saving, as quick proof to check basic perf and verify that environ is ready to go

Would like to discuss adding a 'dry_run' flag as a general option. What is does: if user adds --dry_run, then the config file specified is still run with everything the...

enhancement

Python Zip and Strict = True is 3.10 only...fails on 3.9 with TypeError: zip() takes no keyword arguments

In parallelism/__init__.py, build mesh uses zip with the kwarg strict=True. ~~~ for d, name in zip( [self.dp, self.sp, self.pp], ["dp", "sp", "pp"], strict=True ): ~~~ This is apparently 3.10+ keyword,...

documentation

Rope embeddings with complex number generates compile warning...possible options

Currently torch compiling the default llama model will generate a warning re: being unable to lower complex numbers. ``` torch/_inductor/lowering.py:1639: UserWarning: Torchinductor does not support code generation for complex operators....

bug

add context parallelism when ready

showcase context parallelism here once that feature is ready.

enhancement

SLS and parameter groups for larger datasets?

I'm hitting an issue though in using/testing as the code seems to assume no parameter groups? (from utils.py) def get_grad_list(params): return [p.grad for p in params] this fails b/c p.grad...