TensorFlowASR Support for Streaming Conformer Transducer

This PR is an attempt at adding support for the Streaming Conformer Transducer network.

The changes that have been identified are:

DepthwiseConv2D inside ConvModule needs to have padding='causal'
MHSA layer must receive a mask that indicates which chunks to use for each timestep. 2.1. A parameter for history_window_size needs to be added to config and dataset preprocessing.

[x] Create StreamingConformer class
[x] Set ConvModule -> DepthwiseConv2D padding to causal when streaming=True
[x] Create ASRMaskedDataset. It must compute the required mask using history_window_size.
[x] Pass mask to MHSA layer (Create wrapper StreamingConformerEncoder for this?).
[x] Create custom DepthwiseConv1D with support for padding=causal
[x] Create ASRMaskedTFRecordDataset for working with TFRecords.
[x] Load time_reduction_factor into ASRMaskedDataset dynamically. (currently hardcoded)
[x] Add pure TF mask creation functon with optional pre-compute.
[ ] Clean up StreamingConformer class. Remove unnecessary methods copied from StreamingTransducer.
[ ] Use correct optimizer and hyper parameters
[ ] Inference???
[ ] Cleanup

Deferred:

[ ] Create MaskedTransducerTrainerGA for working with gradient accumulation. (GA is no longer supported)

All comments and edits are welcome.

Apr 15 '21 18:04 andreselizondo-adestech

This PR is aimed to advance the TODO list on #14 .

Apr 15 '21 18:04 andreselizondo-adestech

@usimarit Hello! I've run into a problem. The Streaming Conformer Transducer (SCT) paper states that we need to convert the depthwise convolution inside the "ConvModel" to a "causal" depthwise convolution. However, this requires it to be a Conv1D, not a Conv2D. Take a look at: tensorflow_asr/models/conformer.py#L158

My question is... why is a Conv2D being used? I've double checked with the original Conformer paper and it's supposed to be a Conv1D. Maybe there's something that I'm missing.

Apr 16 '21 16:04 andreselizondo-adestech

I noticed you're using a DepthwiseConv2D with kernel_size=(32,1). Would you consider replacing this by a SeparableConv1D with kernel_size=(32)? We would also need to specify a value for filters, I'd guess filters=input_dim would be good enough, no? :man_shrugging:

Here's what that would look like:

self.dw_conv = tf.keras.layers.SeparableConv1D(
    filters=input_dim,
    kernel_size=(kernel_size), strides=1,
    padding="same" if not streaming else "causal",
    name=f"{name}_dw_conv",
    depth_multiplier=depth_multiplier,
    depthwise_regularizer=kernel_regularizer,
    bias_regularizer=bias_regularizer
)

Apr 16 '21 16:04 andreselizondo-adestech

Good news @usimarit, most of the initial work is done. The model is now trainable, though a few things are still missing. Please take a look and let me know what you think :smiley:

I had to modify the base Conformer class, but the changes done should not affect anything.

Apr 16 '21 21:04 andreselizondo-adestech

I noticed you're using a DepthwiseConv2D with kernel_size=(32,1). Would you consider replacing this by a SeparableConv1D with kernel_size=(32)? We would also need to specify a value for filters, I'd guess filters=input_dim would be good enough, no? 🤷‍♂️

Here's what that would look like:
self.dw_conv = tf.keras.layers.SeparableConv1D(
    filters=input_dim,
    kernel_size=(kernel_size), strides=1,
    padding="same" if not streaming else "causal",
    name=f"{name}_dw_conv",
    depth_multiplier=depth_multiplier,
    depthwise_regularizer=kernel_regularizer,
    bias_regularizer=bias_regularizer
)

Sorry for the late reply, SeparableConv1D is the DepthwiseConv2D combine with the Conv1D after that, so the architecture would be wrong if you apply SeparableConv1D.

Apr 17 '21 04:04 nglehuy

@andreselizondo-adestech We will have a big change in the repo structure as in the PR https://github.com/TensorSpeech/TensorFlowASR/pull/177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure :smile:

Apr 17 '21 04:04 nglehuy

@andreselizondo-adestech We will have a big change in the repo structure as in the PR #177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure 😄

Understood, I'll look into the new format :)

Regarding the SeparableConv1D. I now see what you mean, it seems odd to me that DepthwiseConv1D only exists as a combination of both layers. This means internally the implementation is supported, it's just not exposed for us to use.

I found this issue/PR (https://github.com/tensorflow/tensorflow/issues/48557) on the Tensorflow repository. They intend to add support for the layer we need. However, the issue was opened less than 24hrs ago, so we'll have to wait and see how long it takes to be released into tf-nightly.

Apr 17 '21 09:04 andreselizondo-adestech

@andreselizondo-adestech We can build our own DepthwiseConv1D 😄 no need to wait until tensorflow support it.

Apr 17 '21 09:04 nglehuy

@andreselizondo-adestech The refactor PR is merged 😄

Apr 17 '21 17:04 nglehuy

@usimarit I'm merging my changes into the refactored code, however.. there appears to be an issue using SentencePiece for training. Specifically at line tensorflow_asr/featurizers/text_featurizers.py#L342. Seems like this function is nowhere to be found, but at the same time, the default value for model in that function is None. So when being called from examples/conformer/train.py#L68, model is not specified and is therefore None.

Apr 19 '21 15:04 andreselizondo-adestech

@andreselizondo-adestech Ah yeah, I missed that part, I'll update it.

Apr 19 '21 15:04 nglehuy

@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of DepthwiseConv1D.

I've been digging into how TF does the SeparableConv1D, but they just call SeparableConv2D (similar to how you did it). So I looked into SeparableConv2D and DepthwiseConv2D but I couldn't find the implementation for this TF operation

Could you help me out with this?

Apr 19 '21 18:04 andreselizondo-adestech

@usimarit I took an implementation for DepthwiseConv1D I found online and did slight modifications to support causal padding. (Check commit: 50def47) The model compiles and trains correctly, but we still need to check if it's mathematically equivalent.

This implementation also uses DepthwiseConv2D internally, but it first adds padding to the inputs to generate the causal inputs.

if self.padding == 'causal':
    inputs = array_ops.pad(inputs, self._compute_causal_padding(inputs))

Seems to me like this should work just fine.

Apr 19 '21 21:04 andreselizondo-adestech

@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of DepthwiseConv1D.

I've been digging into how TF does the SeparableConv1D, but they just call SeparableConv2D (similar to how you did it). So I looked into SeparableConv2D and DepthwiseConv2D but I couldn't find the implementation for this TF operation

Could you help me out with this?

Seem like it's from tf c/c++ library 😄

Apr 21 '21 03:04 nglehuy

I'm currently running a test on two VMs: Regular Conformer vs DepthwiseConv1D Conformer We'll see the results in maybe ~30hrs. (I am training on CommonVoice2 dataset though, so WER results won't be directly comparable to the paper.)

@usimarit In the mean time, I'm not sure how inference should work for the Streaming Conformer. Can you guide me? Do you see something that's missing in the PR?

Apr 21 '21 18:04 andreselizondo-adestech

@usimarit Good news! The two Conformer models converge to the same CER, meaning performance was not impacted negatively by the custom DepthwiseConv1D layer. I trained on chars and the best CER I got was ~5.2 I'll be training on subwords shortly.

In the meantime, I think we should look at how to do steaming inference on the Streaming Conformer Transducer.

Apr 23 '21 17:04 andreselizondo-adestech

@usimarit The next step is looking at the file StreamingConformer class. I based the class on the StreamingTransducer class, so I don't know if there's methods that should be different. Mind requesting any changes necessary? Or you could also explain to me how it should work.

Apr 29 '21 16:04 andreselizondo-adestech

@usimarit The next step is looking at the file StreamingConformer class. I based the class on the StreamingTransducer class, so I don't know if there's methods that should be different. Mind requesting any changes necessary? Or you could also explain to me how it should work.

I haven't had time to dive into how the StreamingConformer work in the inference mode but I think it's quite different than the RnnTransducer (StreamingTransducer before then). I'll try to make time for this.

But anyway we should complete the whole pipeline (training, inference, testing, tflite) before merging 😄

May 16 '21 07:05 nglehuy

@usimarit Hey there, this is just a gentle ping. Do you have anything that might guide me on implementing inference for Streaming Conformer? :)

Jul 07 '21 20:07 andreselizondo-adestech

@andreselizondo-adestech Sorry, I'm currently a bit busy until the end of July. So after that, I can go back to support this feature 😄

Jul 23 '21 15:07 nglehuy

Hello @usimarit Are you still interested in implementing this? Let me know if you need my help :smile:

Oct 28 '21 16:10 andreselizondo-adestech

@andreselizondo-adestech Of course, I'll find some free time to help implement the inference of this In the mean time, can you help me to resolve the conflicts? It's just some conflicts about the code format and imports, I changed from autopep8 to black and used absolute imports instead of relative imports (which is more recommended)

Oct 28 '21 17:10 nglehuy

@andreselizondo-adestech hi, are you still working on this? I think we should compute attention mask in MHA layer instead of in dataset, which is kinda similar to causal attention masking in v2.x version of tfasr, but with history truncation and limited futures

May 06 '24 15:05 nglehuy

TensorFlowASR TensorFlowASR copied to clipboard

Support for Streaming Conformer Transducer

TensorFlowASR
TensorFlowASR copied to clipboard