NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

NeVa::forward - remove device syncs (torch.where) and vectorize over batch dimensions

Open nikitaved opened this issue 1 year ago • 11 comments

The following patch modifies the forward of the NeVa model (the replace_media_embeddings method) by introducing the following improvements:

  • Elimination of torch.where with a single argument. That call introduces device synchronizations when run on GPUs.
  • Media index selection manipulation in the construction of padded_media_indices can be done in a single vectorized call.
  • Reduction in the overall number of kernel launches.
  • Elimination of calls to all. This also causes device syncs. We can leave it, or run it in debug mode only. It is best to have a dedicated test instead of that assert, imho.

nikitaved avatar Jul 11 '24 12:07 nikitaved

@yaoyu-33 , could you please have another look and let the workflow be run?

nikitaved avatar Jul 19 '24 15:07 nikitaved

FWIW, I ran our little test (a neva_pretrain.py invocation) with NEMO_TESTING=1 and things ran fine there.

tfogal avatar Jul 19 '24 21:07 tfogal

@yaoyu-33 , @ericharper , could you please run the workflow again? Thank you!

nikitaved avatar Aug 06 '24 11:08 nikitaved

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Aug 22 '24 01:08 github-actions[bot]

bump to kill 'stale'. i checked that i have access to an internal cluster today, will check convergence soon

tfogal avatar Aug 22 '24 04:08 tfogal

Thanks you, @tfogal ! FWIW, the very last CI run turned out to be green

nikitaved avatar Aug 22 '24 10:08 nikitaved

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Sep 06 '24 01:09 github-actions[bot]

Is it possible to move it forward? Commenting to remove the current stale status.

nikitaved avatar Sep 12 '24 09:09 nikitaved

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Oct 09 '24 01:10 github-actions[bot]

Remove stale label or comment or update or this will be closed in 7 days.

comment

tfogal avatar Oct 09 '24 23:10 tfogal

@yaoyu-33 , could you please enable the workflow so that we could merge it once the tests are green?

nikitaved avatar Oct 17 '24 14:10 nikitaved

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Nov 01 '24 02:11 github-actions[bot]

" Wait it out Gonna wait it out Wait it out (be patient) " The Patient, Tool

nikitaved avatar Nov 05 '24 09:11 nikitaved

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Nov 21 '24 02:11 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Nov 28 '24 02:11 github-actions[bot]

Is there an interest in reviving this PR? I assume it is rather low impact with perf benefits assuming the tests are green. And green they were the last time CIs were run.

nikitaved avatar Nov 28 '24 10:11 nikitaved