NeVa::forward - remove device syncs (torch.where) and vectorize over batch dimensions
The following patch modifies the forward of the NeVa model (the replace_media_embeddings method) by introducing the following improvements:
- Elimination of
torch.wherewith a single argument. That call introduces device synchronizations when run on GPUs. - Media index selection manipulation in the construction of
padded_media_indicescan be done in a single vectorized call. - Reduction in the overall number of kernel launches.
- Elimination of calls to
all. This also causes device syncs. We can leave it, or run it in debug mode only. It is best to have a dedicated test instead of that assert, imho.
@yaoyu-33 , could you please have another look and let the workflow be run?
FWIW, I ran our little test (a neva_pretrain.py invocation) with NEMO_TESTING=1 and things ran fine there.
@yaoyu-33 , @ericharper , could you please run the workflow again? Thank you!
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
bump to kill 'stale'. i checked that i have access to an internal cluster today, will check convergence soon
Thanks you, @tfogal ! FWIW, the very last CI run turned out to be green
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
Is it possible to move it forward? Commenting to remove the current stale status.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
Remove stale label or comment or update or this will be closed in 7 days.
comment
@yaoyu-33 , could you please enable the workflow so that we could merge it once the tests are green?
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
" Wait it out Gonna wait it out Wait it out (be patient) " The Patient, Tool
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
Is there an interest in reviving this PR? I assume it is rather low impact with perf benefits assuming the tests are green. And green they were the last time CIs were run.