What does this PR do ?

Add SpeechLLM training/inference scripts to NeMo, along with dataset, model, examples and test.

Main features

Model class for SALM-style architecture, which supports SFT&PEFT.
Some auxiliary modules to support multi-layer feature extraction, and multiple audio encoders.
Dataset class for audio-text-question-answering tasks (generalized for any audio-to-text tasks)
Detailed examples on training and evaluating SpeechLLMs
Minor updates to Megatron code to work with SpeechLLM, removing some hard assumptions (e.g., assert, strict=True). Minor updates to data utils that move dict data to cuda and split into micro batches.

Collection: [common,nlp,multimodal]

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation

Mar 25 '24 15:03 stevehuang52

@titu1994 @nithinraok could you please take another look to see if your comments have been addressed? Thanks~

Apr 10 '24 22:04 stevehuang52

Steve, can you look at CodeQL comments.

Apr 11 '24 16:04 nithinraok

@titu1994 @zhehuaichen I've refactored the dataset such that the input and output keys can be configured dynamically by setting context_key and answer_key in the dataset. For example, if we want to use input_text and output_text as the text input and output keys in manifest, we can set context_key='input_text' and answer_key='output_text'. The defaults are context and answer, and I also added a backward compatibility check for the old question field.

Apr 17 '24 18:04 stevehuang52

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

Apr 17 '24 18:04 stevehuang52

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

sg for removing but is it possible to still include that part of the training in the release ckpt?

Apr 17 '24 23:04 zhehuaichen

@zhehuaichen FYI I removed the random context training trick from the dataset, since it only makes sense for word-boosting and not other tasks. It's better to actually generate those word-boosting manifests instead of doing the trick which may hurt other tasks.

sg for removing but is it possible to still include that part of the training in the release ckpt?

@zhehuaichen Yes the checkpoint trained for release has that included

Apr 18 '24 13:04 stevehuang52

jenkins

May 06 '24 01:05 stevehuang52

@titu1994 how to invoke CI tests? I tried jenkins but it didn't seem to work...

May 06 '24 09:05 stevehuang52

@aklife97 Could you please review the changes to the NLP collection? There're mainly two changes:

Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
Changing hard assert to raising warnings when loading adapters that have different params than the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let me know if you have any questions, thanks~!

May 07 '24 00:05 stevehuang52

Hi @ericharper, could you please help reviewing (or assign someone else available to review) the small changes to the NLP collection? There're mainly two changes:

Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
Changing hard assert to raising warnings when loading adapters that have different params from the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let me know if you have any questions, thanks~!

May 09 '24 01:05 stevehuang52

@arendu suggested by Abhinav, could you please help review the small changes to the NLP collection? There're mainly two changes:

Modifying get_iterator_k_split to support splitting non-tensor objects (e.g., lists), while the behavior is same as before if the batch only has tensor objects.
Changing hard assert to raising warnings when loading adapters that have different params from the actual adapters in LLM. This is needed since we store the ASR encoders in the same checkpoint as the GPT adapters, where it leads to additional params when loading the adapter checkpoint for the GPT adapter.

Please let us know if you have any questions, thanks~!

May 10 '24 05:05 zhehuaichen

NeMo
NeMo copied to clipboard

Add SpeechLM to main

What does this PR do ?

Main features

NeMo NeMo copied to clipboard

Add SpeechLM to main

What does this PR do ?

Main features

NeMo
NeMo copied to clipboard