Less Wright

Results 44 issues of Less Wright

After PR https://github.com/pytorch/tau/pull/631 lands, add unit testing. Simple tests would involved fusion based on a set policy and verifying output gradients and inspecting the graph.

Per Alisson - we can reduce memory overhead by having the global buffer first created at first use (i.e. just before first fusion) rather than the current instantiation at the...

This is to track/investigate the issue reported by Rich Zhu, where using permute to generate a transposed tensor for nn.linear, results in an incorrect aten.expand call. I've found two potential...

After expansion of DTensor communication operations, fx is inserting a clone operation to clone the gradient tensor. This operation will slow down the perf and add memory, but is technically...

In running the pytests for a recent PR, I was allocated a 3 gpu server rather than 4 gpu. (presumably a bad gpu on a 4 gpu server, but unclear...

Implement a graph using torch.cat. convert it via SPMD Receive: raise NotImplementedError( NotImplementedError: Operator aten.cat.default does not have a DistributedTensor rule registered.) code location: File "/home/ubuntu/graph/spmd/api.py", line 110, in _get_dtensor_dispatch_graph...

Hi - We're trying to consolidate on using tensorboard but sporadically hitting an issue where reports won't load and instead just get a blank white screen. This occurs on multiple...

per user request, we don't currently have any info on how to do this. (basically extend the hf_dataset file).

documentation
enhancement

This PR adds the option to selectively compile just the norm layers only, and is mainly targeted at RMSNorm. By compiling just the norm layers when using rmsnorm, we get...

CLA Signed

Tried to run the grouped gemm tutorial on Hopper/H100, https://github.com/openai/triton/blob/main/python/tutorials/11-grouped-gemm.py I realize this is an experimental tutorial, but I hit this same error while working on another kernel and meant...