ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[FEATURE]: tensor parallel microbenchmark changes to support microbenchmarking large models

Open MEllis-github opened this issue 2 years ago • 0 comments

Describe the feature

Problem The intrahost microbenchmarking CLI tool executes the "None" (DDP) strategy first, and when it OOMs, the microbenchmark does not proceed to the tensor parallel strategies.

Desired solution/support Intrahost tensor parallelism is most important for large models, relative to available memory, so the tensor parallel microbenchmark would ideally support benchmarking model sizes that do not fit in a single GPU's memory.

Potential fixes (not mutually exclusive)

  1. Use try-catch clauses to attempt all strategies even if some fail with errors.
  2. Reorder of the strategy execution order.
  3. Parameterize the strategy selection and potentially also the sequence of strategies executed.

MEllis-github avatar Mar 10 '23 19:03 MEllis-github