NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Add SpeechLM to main

Open stevehuang52 opened this issue 11 months ago • 10 comments

What does this PR do ?

Add SpeechLLM training/inference scripts to NeMo, along with dataset, model, examples and test.

Main features

  • Model class for SALM-style architecture, which supports SFT&PEFT.
  • Some auxiliary modules to support multi-layer feature extraction, and multiple audio encoders.
  • Dataset class for audio-text-question-answering tasks (generalized for any audio-to-text tasks)
  • Detailed examples on training and evaluating SpeechLLMs
  • Minor updates to Megatron code to work with SpeechLLM, removing some hard assumptions (e.g., assert, strict=True). Minor updates to data utils that move dict data to cuda and split into micro batches.

Collection: [common,nlp,multimodal]

PR Type:

  • [x] New Feature
  • [ ] Bugfix
  • [ ] Documentation

stevehuang52 avatar Mar 25 '24 15:03 stevehuang52

@titu1994 @nithinraok could you please take another look to see if your comments have been addressed? Thanks~

stevehuang52 avatar Apr 10 '24 22:04 stevehuang52

Steve, can you look at CodeQL comments.

nithinraok avatar Apr 11 '24 16:04 nithinraok

@titu1994 @zhehuaichen I've refactored the dataset such that the input and output keys can be configured dynamically by setting context_key and answer_key in the dataset. For example, if we want to use input_text and output_text as the text input and output keys in manifest, we can set context_key='input_text' and answer_key='output_text'. The defaults are context and answer, and I also added a backward compatibility check for the old question field.

stevehuang52 avatar Apr 17 '24 18:04 stevehuang52

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

stevehuang52 avatar Apr 17 '24 18:04 stevehuang52

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

sg for removing but is it possible to still include that part of the training in the release ckpt?

zhehuaichen avatar Apr 17 '24 23:04 zhehuaichen

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

sg for removing but is it possible to still include that part of the training in the release ckpt?

@zhehuaichen Yes the checkpoint trained for release has that included

stevehuang52 avatar Apr 18 '24 13:04 stevehuang52

jenkins

stevehuang52 avatar May 06 '24 01:05 stevehuang52

@titu1994 how to invoke CI tests? I tried jenkins but it didn't seem to work...

stevehuang52 avatar May 06 '24 09:05 stevehuang52

@aklife97 Could you please review the changes to the NLP collection? There're mainly two changes:

  • Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
  • Changing hard assert to raising warnings when loading adapters that have different params than the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let me know if you have any questions, thanks~!

stevehuang52 avatar May 07 '24 00:05 stevehuang52

Hi @ericharper, could you please help reviewing (or assign someone else available to review) the small changes to the NLP collection? There're mainly two changes:

  • Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
  • Changing hard assert to raising warnings when loading adapters that have different params from the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let me know if you have any questions, thanks~!

stevehuang52 avatar May 09 '24 01:05 stevehuang52

@arendu suggested by Abhinav, could you please help review the small changes to the NLP collection? There're mainly two changes:

  • Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
  • Changing hard assert to raising warnings when loading adapters that have different params from the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let us know if you have any questions, thanks~!

zhehuaichen avatar May 10 '24 05:05 zhehuaichen