Less Wright

Results 44 issues of Less Wright

have a test in another repo, move it and integrate for torchtrain to ensure no future issues with any changes.

better_engineering

Adding this as tracking issue to unblock https://github.com/pytorch/torchtrain/pull/181 from landing: per @wanchaol : IMO we should also register the fwd/bwd rmsnorm kernel as a PyTorch op, this is so that:...

bug
enhancement

As part of e2e training, encountered wild loss curve spikes: After additional hyperparam tuning and further investigation, the root cause is that we are reading the dataset sequentially, so to...

enhancement

to help monitor training stability.

enhancement

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?). "we introduce StreamingLLM, an efficient framework that enables...

enhancement

Would like to discuss adding a 'dry_run' flag as a general option. What is does: if user adds --dry_run, then the config file specified is still run with everything the...

enhancement

In parallelism/__init__.py, build mesh uses zip with the kwarg strict=True. ~~~ for d, name in zip( [self.dp, self.sp, self.pp], ["dp", "sp", "pp"], strict=True ): ~~~ This is apparently 3.10+ keyword,...

documentation

Currently torch compiling the default llama model will generate a warning re: being unable to lower complex numbers. ``` torch/_inductor/lowering.py:1639: UserWarning: Torchinductor does not support code generation for complex operators....

bug

showcase context parallelism here once that feature is ready.

enhancement

I'm hitting an issue though in using/testing as the code seems to assume no parameter groups? (from utils.py) def get_grad_list(params): return [p.grad for p in params] this fails b/c p.grad...