Olivia Lee
Olivia Lee
 this is training loss of the last step
I think some ops should propagate its result rather than the shape of its result in order to let following ops work properly during shape inference, for example, consider the...
As suggested by the title, this PR attempts to add torch.compile support for mistral, and this is a not-ready-to-merge PR, it tries to replicate what has been done in Llama...
hey, i am using model.save as you mentioned so that i could get .pb file, but it turns out that i only get a file without any suffixes which is...
This PR is working in progress and it tries to add torch compile support for Mixtral, it currently also contains changes from #30642 because there are some common ground shared...
This pr fixes a scenario where we want to use dynamo trace in training mode, the current attn mask ignore logic creates a problem where data-dependent branch condition `torch.all(attn_mask==1)` will...
The parameter cache instance is needed to handle recompilation where we need to make sure the parameters we created in the first run are used, currently the use case does...
As per title, this PR tries a more general approach rather than relying purely on human heuristics, basically it uses the following steps to search a possible parallelization strategy for...
# What does this PR do? - [x] add backend abstraction - [x] refactor the original pipeline flow to accommodate potential needs of different backend - [x] modify API so...