optimum
optimum copied to clipboard
Modify Parallelization Strategy to Make it More General
As per title, this PR tries a more general approach rather than relying purely on human heuristics, basically it uses the following steps to search a possible parallelization strategy for a transformer model
- Use dynamo for graph tracing so that we get the graph to operate on
- Decompose and functionalize the traced graph so that we get a smaller op set to work with
- Apply parallel axis analysis and do a constrained backtracking search on the whole graph to get a possible solution(not necessarily optimal)
- Replace ops the original traced graph with their parallelized version(Linear -> ColumnLinear/RowLinear)
And for the API design, we disable the support of passing custom modules and only focus on models in transformers because supporting custom models is not the priority for now.