TensorFlowASR
TensorFlowASR copied to clipboard
Support for Streaming Conformer Transducer
This PR is an attempt at adding support for the Streaming Conformer Transducer network.
The changes that have been identified are:
- DepthwiseConv2D inside ConvModule needs to have
padding='causal'
- MHSA layer must receive a mask that indicates which chunks to use for each timestep.
2.1. A parameter for
history_window_size
needs to be added to config and dataset preprocessing.
- [x] Create StreamingConformer class
- [x] Set ConvModule -> DepthwiseConv2D
padding
tocausal
whenstreaming=True
- [x] Create
ASRMaskedDataset
. It must compute the required mask usinghistory_window_size
. - [x] Pass
mask
to MHSA layer (Create wrapperStreamingConformerEncoder
for this?). - [x] Create custom
DepthwiseConv1D
with support forpadding=causal
- [x] Create
ASRMaskedTFRecordDataset
for working with TFRecords. - [x] Load
time_reduction_factor
intoASRMaskedDataset
dynamically. (currently hardcoded) - [x] Add pure TF mask creation functon with optional pre-compute.
- [ ] Clean up
StreamingConformer
class. Remove unnecessary methods copied fromStreamingTransducer
. - [ ] Use correct optimizer and hyper parameters
- [ ] Inference???
- [ ] Cleanup
Deferred:
- [ ] Create
MaskedTransducerTrainerGA
for working with gradient accumulation. (GA is no longer supported)
All comments and edits are welcome.
This PR is aimed to advance the TODO list on #14 .
@usimarit Hello! I've run into a problem. The Streaming Conformer Transducer (SCT) paper states that we need to convert the depthwise convolution inside the "ConvModel" to a "causal" depthwise convolution. However, this requires it to be a Conv1D, not a Conv2D. Take a look at: tensorflow_asr/models/conformer.py#L158
My question is... why is a Conv2D being used? I've double checked with the original Conformer paper and it's supposed to be a Conv1D. Maybe there's something that I'm missing.
I noticed you're using a DepthwiseConv2D
with kernel_size=(32,1)
.
Would you consider replacing this by a SeparableConv1D
with kernel_size=(32)
?
We would also need to specify a value for filters
, I'd guess filters=input_dim
would be good enough, no? :man_shrugging:
Here's what that would look like:
self.dw_conv = tf.keras.layers.SeparableConv1D(
filters=input_dim,
kernel_size=(kernel_size), strides=1,
padding="same" if not streaming else "causal",
name=f"{name}_dw_conv",
depth_multiplier=depth_multiplier,
depthwise_regularizer=kernel_regularizer,
bias_regularizer=bias_regularizer
)
Good news @usimarit, most of the initial work is done. The model is now trainable, though a few things are still missing. Please take a look and let me know what you think :smiley:
I had to modify the base Conformer
class, but the changes done should not affect anything.
I noticed you're using a
DepthwiseConv2D
withkernel_size=(32,1)
. Would you consider replacing this by aSeparableConv1D
withkernel_size=(32)
? We would also need to specify a value forfilters
, I'd guessfilters=input_dim
would be good enough, no? 🤷♂️Here's what that would look like:
self.dw_conv = tf.keras.layers.SeparableConv1D( filters=input_dim, kernel_size=(kernel_size), strides=1, padding="same" if not streaming else "causal", name=f"{name}_dw_conv", depth_multiplier=depth_multiplier, depthwise_regularizer=kernel_regularizer, bias_regularizer=bias_regularizer )
Sorry for the late reply, SeparableConv1D
is the DepthwiseConv2D
combine with the Conv1D
after that, so the architecture would be wrong if you apply SeparableConv1D
.
@andreselizondo-adestech We will have a big change in the repo structure as in the PR https://github.com/TensorSpeech/TensorFlowASR/pull/177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure :smile:
@andreselizondo-adestech We will have a big change in the repo structure as in the PR #177. Please be aware of that 😄 These changes will split the conformer file to the encoder file and the model file like this. I about to finish that PR so you will have to pull the main, create a new branch and cherry pick what you've done into the new structure 😄
Understood, I'll look into the new format :)
Regarding the SeparableConv1D. I now see what you mean, it seems odd to me that DepthwiseConv1D only exists as a combination of both layers. This means internally the implementation is supported, it's just not exposed for us to use.
I found this issue/PR (https://github.com/tensorflow/tensorflow/issues/48557) on the Tensorflow repository. They intend to add support for the layer we need. However, the issue was opened less than 24hrs ago, so we'll have to wait and see how long it takes to be released into tf-nightly.
@andreselizondo-adestech We can build our own DepthwiseConv1D
😄 no need to wait until tensorflow support it.
@andreselizondo-adestech The refactor PR is merged 😄
@usimarit I'm merging my changes into the refactored code, however.. there appears to be an issue using SentencePiece
for training.
Specifically at line tensorflow_asr/featurizers/text_featurizers.py#L342.
Seems like this function is nowhere to be found, but at the same time, the default value for model
in that function is None
. So when being called from examples/conformer/train.py#L68, model is not specified and is therefore None
.
@andreselizondo-adestech Ah yeah, I missed that part, I'll update it.
@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of DepthwiseConv1D
.
I've been digging into how TF does the SeparableConv1D, but they just call SeparableConv2D
(similar to how you did it).
So I looked into SeparableConv2D
and DepthwiseConv2D
but I couldn't find the implementation for this TF operation
Could you help me out with this?
@usimarit I took an implementation for DepthwiseConv1D
I found online and did slight modifications to support causal padding. (Check commit: 50def47)
The model compiles and trains correctly, but we still need to check if it's mathematically equivalent.
This implementation also uses DepthwiseConv2D
internally, but it first adds padding to the inputs to generate the causal inputs.
if self.padding == 'causal':
inputs = array_ops.pad(inputs, self._compute_causal_padding(inputs))
Seems to me like this should work just fine.
@usimarit I've adapted my changes to the refactored repo and everything seems to be working. Next step is to create our own implementation of
DepthwiseConv1D
.I've been digging into how TF does the SeparableConv1D, but they just call
SeparableConv2D
(similar to how you did it). So I looked intoSeparableConv2D
andDepthwiseConv2D
but I couldn't find the implementation for this TF operationCould you help me out with this?
Seem like it's from tf c/c++ library 😄
I'm currently running a test on two VMs: Regular Conformer vs DepthwiseConv1D Conformer We'll see the results in maybe ~30hrs. (I am training on CommonVoice2 dataset though, so WER results won't be directly comparable to the paper.)
@usimarit In the mean time, I'm not sure how inference should work for the Streaming Conformer. Can you guide me? Do you see something that's missing in the PR?
@usimarit Good news! The two Conformer models converge to the same CER, meaning performance was not impacted negatively by the custom DepthwiseConv1D layer. I trained on chars and the best CER I got was ~5.2 I'll be training on subwords shortly.
In the meantime, I think we should look at how to do steaming inference on the Streaming Conformer Transducer.
@usimarit The next step is looking at the file StreamingConformer
class.
I based the class on the StreamingTransducer
class, so I don't know if there's methods that should be different.
Mind requesting any changes necessary? Or you could also explain to me how it should work.
@usimarit The next step is looking at the file
StreamingConformer
class. I based the class on theStreamingTransducer
class, so I don't know if there's methods that should be different. Mind requesting any changes necessary? Or you could also explain to me how it should work.
I haven't had time to dive into how the StreamingConformer work in the inference mode but I think it's quite different than the RnnTransducer
(StreamingTransducer
before then). I'll try to make time for this.
But anyway we should complete the whole pipeline (training, inference, testing, tflite) before merging 😄
@usimarit Hey there, this is just a gentle ping. Do you have anything that might guide me on implementing inference for Streaming Conformer? :)
@andreselizondo-adestech Sorry, I'm currently a bit busy until the end of July. So after that, I can go back to support this feature 😄
Hello @usimarit Are you still interested in implementing this? Let me know if you need my help :smile:
@andreselizondo-adestech Of course, I'll find some free time to help implement the inference of this
In the mean time, can you help me to resolve the conflicts? It's just some conflicts about the code format and imports, I changed from autopep8
to black
and used absolute imports instead of relative imports (which is more recommended)
@andreselizondo-adestech hi, are you still working on this? I think we should compute attention mask in MHA layer instead of in dataset, which is kinda similar to causal attention masking in v2.x version of tfasr, but with history truncation and limited futures