icefall Deletions and other errors at start of utterance

(sorry, I'm keeping track of this in an issue on github but this is not something accessible off our servers right now). Guys, As you know I've been working on some optimization ideas, Right now was having a version of the model showing some unexpectedly poor WER performance (on test-clean, it's giving around 11% or 12% WER, vs. around 6.5% in the baseline, at epcoh 19 of train-clean-100), while optimization performance on train and valid looked fine. So I had a look at the errors, and the extra errors are mostly at the start of the utterances: /ceph-dan/icefall/egs/librispeech/ASR/pruned_transducer_stateless7/exp_pradam_exp1d/greedy_search/recogs-test-clean-greedy_search-epoch-19-avg-7-context-2-max-sym-per-frame-1-use-averaged-model.txt

(ROBIN->BBIT) AND THE LITTLE TUMBLER BETWEEN THEM TRIED TO FORCE THE SQUIRE TO STAND BACK AND VERY VALIANTLY DID THESE TWO (COMPORT->COMPANT) THEMSELVES
(THE CAT->T) GROWLED SOFTLY PICKED UP THE PRIZE IN HER JAWS AND TROTTED INTO THE BUSHES TO DEVOUR IT
FRIENDS SAID (MONTFICHET->MONCHET) FAINTLY TO THE WRESTLERS (BEAR->BEARS) US ESCORT SO FAR AS THE SHERIFF'S HOUSE
(HE->*) WAS SOFT HEARTED AND IMPETUOUS SAID BETH AND BEING IN LOVE HE DIDN'T STOP TO COUNT THE COST
BUT THE MORE FORGETFULNESS HAD THEN PREVAILED THE MORE POWERFUL WAS THE FORCE OF REMEMBRANCE WHEN SHE AWOKE
(I DID NOT MEAN->TEN) SAID CAPTAIN (BATTLEAX->BATTLEX) TO TOUCH UPON PUBLIC SUBJECTS AT SUCH A MOMENT AS THIS
(IF->F) IT ONLY WERE NOT SO DARK HERE AND SO TERRIBLY LONELY
(WELL->LE) SAID (MADEMOISELLE DE TONNAY CHARENTE->MAUD MOUSELLE DETER) I ALSO THINK A GOOD DEAL BUT I TAKE CARE
(WITHOUT->LE) HIS (SCRAPBOOKS->SCRAP BOOKS) HIS CHEMICALS AND HIS HOMELY (UNTIDINESS->ENTITINESS) HE WAS AN UNCOMFORTABLE MAN
(EITHER->ITHER) HE CALLS MINISTERS THROUGH THE AGENCY OF MEN OR HE CALLS THEM DIRECTLY AS HE CALLED THE PROPHETS (AND APOSTLES->IN APOSTLE)
(OH->H) SHE'S ALWAYS AT THE PIANO SAID (VAN->ANNE) SHE MUST BE THERE NOW SOMEWHERE AND THEN SOMEBODY LAUGHED

A baseline model that was performing well (these are both trained on libri-100) has its output at /ceph-dan/icefall/egs/librispeech/ASR/pruned_transducer_stateless7/exp_merge_stats2_ceil_0.75_4.0_cond_mf2048/greedy_search/errs-test-clean-greedy_search-epoch-19-avg-7-context-2-max-sym-per-frame-1-use-averaged-model.txt and the same part is:

ROBIN (AND->*) THE LITTLE TUMBLER BETWEEN THEM TRIED TO FORCE THE SQUIRE TO STAND BACK AND VERY VALIANTLY DID THESE TWO (COMPORT->COMP) THEMSELVES
THE CAT GROWLED SOFTLY PICKED UP THE PRIZE IN HER JAWS AND TROTTED INTO THE BUSHES TO DEVOUR IT
FRIENDS SAID (MONTFICHET->MONTFIET) FAINTLY TO THE WRESTLERS BEAR US ESCORT SO FAR AS THE SHERIFF'S HOUSE
HE WAS SOFT HEARTED AND IMPETUOUS SAID BETH AND BEING IN LOVE HE DIDN'T STOP TO COUNT THE COST
BUT THE MORE FORGETFULNESS HAD THEN PREVAILED THE MORE POWERFUL WAS THE FORCE OF REMEMBRANCE WHEN SHE AWOKE
I DID NOT MEAN SAID CAPTAIN (BATTLEAX->BATTLE LIKE) TO TOUCH UPON PUBLIC SUBJECTS AT SUCH A MOMENT AS THIS
IF IT ONLY WERE NOT SO DARK HERE AND SO TERRIBLY LONELY
WELL SAID (MADEMOISELLE DE TONNAY CHARENTE->MAUD MORALE DE TERLAND) I ALSO THINK A GOOD DEAL BUT I TAKE CARE
WITHOUT HIS (SCRAPBOOKS->SCRAP BOOKS) HIS CHEMICALS AND HIS HOMELY (UNTIDINESS->AND TIDINESS) HE WAS AN UNCOMFORTABLE MAN
EITHER HE CALLS MINISTERS THROUGH THE AGENCY OF MEN OR HE CALLS THEM DIRECTLY AS HE CALLED THE PROPHETS (AND->IN) APOSTLES
OH SHE'S ALWAYS AT THE PIANO SAID (VAN->ANNE) SHE MUST BE THERE NOW SOMEWHERE AND THEN SOMEBODY LAUGHED

I have verified the following:

The branches only differ in ways that would affect them in training time, not test time
I can reproduce the old (good-performing) model's decoding, and there aren't code differences that would affect the decoding.
The issue is not one related to model averaging (it exists even without averaging, e.g. /ceph-dan/icefall/egs/librispeech/ASR/pruned_transducer_stateless7/exp_pradam_exp1d/greedy_search/errs-test-other-greedy_search-epoch-19-avg-1-context-2-max-sym-per-frame-1.txt)
The search method doesn't seem to have a strong influence, e.g. changing greedy_search to modified_beam_search changes WER from 11.83 to 10.72 on test-clean at --epoch 19 --avg 7, but does not affect the preponderance of errors at the start of the utterance.
The --max-sym-per-frame doesn't have a strong influence, e.g. increasing it from 1 to 2 with greedy_search changed WER from 11.83 to 11.78.
The models are very similar in terms of validation and training loss.

Now, the first thought would obviously be, it's something to do with the optimizer, the optimizer is not good, but the fact that the errors are disproportionately at the start of utterances makes me think that this may be a more specific issue, e.g. something to do with the left-context at the start of the utterance, either the acoustic model's left-context or the symbol left-context.

Note, I am not super up-to-date with master code. Actually I don't want to be 100% up-to-date because I have json.gz, not jsonl.gz, manifests on disk and I want things to be comparable with my previously-run experiments so I don't want to change this.

Does anyone have any ideas? Could it be a previously-solved issue?

Jul 26 '22 21:07 danpovey

Note, I am not super up-to-date with master code. Actually I don't want to be 100% up-to-date because I have json.gz, not jsonl.gz, manifests on disk and I want things to be comparable with my previously-run experiments so I don't want to change this.

If you need to update, you can do lhotse copy manifest.json.gz manifest.jsonl.gz to have identical manifests in a different format.

Jul 26 '22 21:07 pzelasko

I tried debugging this by extending all utterances with silence on both sides, to match the longest utterance, as in

 load_manifest(self.args.manifest_dir / "cuts_test-clean.json.gz").pad(direction='both').fill_supervisions()

(I know this is not ideal). The WER on test-clean degraded from about 11.7 to 12.3. The deletions at utterance start disappeared, But the word LE was inserted at the start of almost every utterance:

(*->LE) THE COUNT SHOOK HIS HEAD
(LOVE->VE) IS A BABE THEN MIGHT I NOT SAY SO TO GIVE FULL GROWTH TO THAT WHICH STILL DOTH GROW
(*->PE) THANK YOU RACHEL MY COUSIN RACHEL MY ONLY FRIEND (*->I)
(*->AR) THE (DUCHESS->DUCHES) OF SOUTHBRIDGE TO (LORD REGGIE OH REGGIE->LORDY ALREADY) WHAT DID YOU SAY
(THE->*) COLORIST SAYS FIRST OF ALL AS MY DELICIOUS PAROQUET WAS RUBY SO THIS NASTY VIPER SHALL BE BLACK AND THEN IS THE QUESTION CAN I ROUND HIM OFF EVEN THOUGH HE IS BLACK AND MAKE HIM (SLIMY->SLI ME) AND YET (SPRINGY->SPRINGI\
NG) AND CLOSE DOWN CLOTTED LIKE A POOL OF BLACK BLOOD ON THE EARTH ALL THE SAME
(*->LE) MISSUS (NEVERBEND->NEVER BEEN) YOU MUST INDEED BE PROUD OF YOUR SON
(*->LE) MISS WOODLEY WAS TOO LITTLE VERSED IN THE SUBJECT TO KNOW THIS WOULD HAVE BEEN NOT TO LOVE AT ALL AT LEAST NOT TO THE EXTENT OF BREAKING THROUGH ENGAGEMENTS AND ALL THE VARIOUS OBSTACLES THAT STILL (MILITATED->MITIGATED)\
 AGAINST THEIR UNION
(*->LE) AMONG OTHER PERSONS OF DISTINCTION WHO UNITED THEMSELVES TO HIM WAS LORD (NAPIER->APIER) OF (MERCHISTON->MURCHISON) SON OF THE FAMOUS INVENTOR OF THE LOGARITHMS THE PERSON TO WHOM THE TITLE OF A GREAT MAN IS MORE JUSTLY \
DUE THAN TO ANY OTHER WHOM HIS COUNTRY EVER PRODUCED
(*->LE) EXQUISITE SOFT TURF OF THE WOODS THE HAPPINESS WHICH YOUR FRIENDSHIP CONFERS UPON ME

If we subtract the (number of sentences / num ref words) from the WER, which is (2600 / 48775) = 5.3%, to correct for the fact almost every sentence has an insertion at the beginning, this takes the WEr from 12.3% to 7%, which is not far from what I was expecting in the first place.

I think we may need to make our models more robust to the presence of exact silence before and after the cut, by sometimes (randomly) inserting exact silence before or after the utterance. But that may be a separate issue from why the model was deleting things at the start of the utterance. I don't see very clearly how that could be caused by a train/test mismatch.

Jul 26 '22 22:07 danpovey

I tried using the training dataloader, including SpecAugment, using test_clean_dl = librispeech.train_dataloaders(test_clean_cuts) (was test_dataloaders()), but the same pattern of errors persists, with of course even worse WERs.

Jul 26 '22 23:07 danpovey

I think I have found the issue. In decoder.py https://github.com/k2-fsa/icefall/blob/385645d5333058d728bed5f8845d598d5f34dae0/egs/librispeech/ASR/pruned_transducer_stateless2/decoder.py#L96 there is code like this:

            if need_pad is True:
                embedding_out = F.pad(
                    embedding_out, pad=(self.context_size - 1, 0)
                )
            else:

... this pads the decoder embeddings on the left with 0. IMO this isn't quite right, it would only be guaranteed correct if the embedding for blank was zero. Instead, IMO we should be padding the labels y with an extra blank-id (since we use blank-id for SOS context), which is also zero of course. @ngoel17 this might have been your problem as well, assuming your problems were actually at the start of the utterance, i.e. the 1st words.

Jul 26 '22 23:07 danpovey

I am trying a fix like this fix to beam_search.py:

@@ -381,7 +382,7 @@ def greedy_search_batch(
         dtype=torch.int64,
     )  # (N, context_size)

-    decoder_out = model.decoder(decoder_input, need_pad=False)
+    decoder_out = model.decoder(decoder_input[:,-1:], need_pad=True)

... it helps the WER by about 2% absolute, which is half what I was hoping for, and, oddly, it does not completely get rid of the deletions at start of utterance. About half of them are still there:

(HE->*) WAS SOFT HEARTED AND IMPETUOUS SAID BETH AND BEING IN LOVE HE DIDN'T STOP TO COUNT THE COST
HE CRIED IN (*->A) HIGH DUDGEON JUST AS IF HE OWNED THE WHOLE OF THE PEPPERS AND COULD DISPOSE OF THEM ALL TO SUIT HIS FANCY
(BUT THE->*) MORE FORGETFULNESS HAD THEN PREVAILED THE MORE POWERFUL WAS THE FORCE OF REMEMBRANCE WHEN SHE AWOKE
(I DID NOT MEAN->TEN) SAID CAPTAIN (BATTLEAX->BATTLEX) TO TOUCH UPON PUBLIC SUBJECTS AT SUCH A MOMENT AS THIS
THEY POINTEDLY DREW BACK FROM JOHN (JAGO->YAGO) AS HE APPROACHED THE EMPTY CHAIR NEXT TO ME AND MOVED ROUND TO THE OPPOSITE SIDE OF THE TABLE
(IF IT ONLY->FING) WERE NOT SO DARK HERE AND SO TERRIBLY LONELY
(WELL->LE) SAID (MADEMOISELLE DE TONNAY CHARENTE->MAUD MUSELLE TO IN A SHAHALENT) I ALSO THINK A GOOD DEAL BUT I TAKE CARE
WITHOUT HIS (SCRAPBOOKS->SCRAP BOOKS) HIS CHEMICALS AND HIS HOMELY (UNTIDINESS->ENTITINESS) HE WAS AN UNCOMFORTABLE MAN
EITHER HE CALLS MINISTERS THROUGH THE AGENCY OF MEN OR HE CALLS THEM DIRECTLY AS HE CALLED THE PROPHETS (AND->IN) APOSTLES
OH SHE'S ALWAYS AT THE PIANO SAID (VAN->ANNE) SHE MUST BE THERE NOW SOMEWHERE AND THEN SOMEBODY LAUGHED

... acoustic padding on top of this, of the type I mentioned above, does not seem to make any difference on top of this decoder code change. (Note: this change should, I think, only in principle solve the mismatch if context_size == 2; if context_size > 2, we would need to use pad=True for at least one non-initial position. And I may have figured out the reason it doesn't fully fix it; if the 1st symbol is blank, we may end up in a loop and have the problem inside the loop).

Jul 27 '22 00:07 danpovey

Also, if I make a similar fix to greedy_search(), by providing --max-sym-per-frame 2, which stops the decode.py from using greedy_search_batch(), it fully resolves the issue -- likely because it doesn't recompute the decoder context inside the loop until it sees a new symbol. WER goes from 11.78% to 6.34.

index ce8b04a..b49f46e 100644
--- a/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py
+++ b/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py
@@ -282,7 +282,7 @@ def greedy_search(
         [blank_id] * context_size, device=device, dtype=torch.int64
     ).reshape(1, context_size)

-    decoder_out = model.decoder(decoder_input, need_pad=False)
+    decoder_out = model.decoder(decoder_input[:,-1:], need_pad=True)
     decoder_out = model.joiner.decoder_proj(decoder_out)

Jul 27 '22 00:07 danpovey

@csukuangfj I think the right "temporary patch" to fix the decoding issue, would be to pass in a negative id to the Decoder module at the start of utterances; this will fix the loop issue. We can modify the Decoder module to use that negative id as a mask (we can do .clamp(min=0) before the nn.Embedding forward and then multiply the result by (label >= 0)). Perhaps you or Kangwei could work with Zhehuai on this if he is at work? I'd like to have you showing the others to do things, so we spread out the knowledge a bit.

Jul 27 '22 00:07 danpovey

this pads the decoder embeddings on the left with 0. IMO this isn't quite right, it would only be guaranteed correct if the embedding for blank was zero.

Indeed, the embedding for the blank is always 0. https://github.com/k2-fsa/icefall/blob/385645d5333058d728bed5f8845d598d5f34dae0/egs/librispeech/ASR/pruned_transducer_stateless2/decoder.py#L60

note that we have set padding_idx to the blank_id, which ensures that the embedding for the blank is always 0.

Jul 27 '22 08:07 csukuangfj

AH. OK. I didn't realize (or I forgot) that nn.Embedding has this option. The optimization method I was working with doesn't guarantee that parameters with zero grads will have zero update, hence the issue. [Incidentally, after debugging it, I found that it was due to very slight errors in the SVD output, nonzero terms like 1e-23 that should have been exactly zeros, were eventually propagating their way into the projections used and causing gradients that should have been zero to become nonzero.]

Jul 27 '22 20:07 danpovey

Anyway, next time we refactor for some other reason, let's do the padding on the symbol level, and remove the 'padding_dim' option, just to simplify it slightly and not rely on this property.

Jul 27 '22 21:07 danpovey

Yes - My issue can be described exactly as random occasional deletions that get triggered at the beginning of the utterance. Will check out the fixes.

Oct 04 '22 19:10 ngoel17

I see a similar phenomenon (deletions in the beginning or the end of sentence) in kaldi also but only for certain models (not always). We never figured out why it's so concentrated to either the beginning or the end. DNN raw outputs seemed to suggest bad training and changing the training changed things slightly but didn't totally fix the problem.

On Wed, Jul 27, 2022, 4:08 AM Fangjun Kuang @.***> wrote:

this pads the decoder embeddings on the left with 0. IMO this isn't quite right, it would only be guaranteed correct if the embedding for blank was zero.

Indeed, the embedding for the blank is always 0.

https://github.com/k2-fsa/icefall/blob/385645d5333058d728bed5f8845d598d5f34dae0/egs/librispeech/ASR/pruned_transducer_stateless2/decoder.py#L60

note that we have set padding_idx to the blank_id, which ensures that the embedding for the blank is always 0.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/499#issuecomment-1196399653, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6CTGYKOAYIMXVHDHVTVWDVBDANCNFSM54XKK2VA . You are receiving this because you were mentioned.Message ID: @.***>

Oct 11 '22 07:10 ngoel17

Hm, perhaps we always have more padding silence during training, at the start of the utterance, and the model isn't expecting the speech to start right away?

My issue turned out to be something quite specific to the optimizer I was using.

Oct 12 '22 14:10 danpovey

In a decode that I did using sherpa, substitution is 7% Insertion 2% but deletion 17%. So I see the deletion issue quite prominently. At the moment it could be sherpa specific, or could be specific to the nature of my test data. I am still looking but clearly, if I can get deletion under control, things will be much better. If you have suggestions please advise what all to try. The basic recipe is pruned_transducer_stateless3 _ fast_beam_search (no HLG).

Nov 06 '22 02:11 ngoel17

@ngoel17 Could you describe which decoding method and which file from sherpa are you using? Also, are the deletions at the end of an utterance most of the time?

Do all the decoding methods have such a high deletion rate?

Nov 06 '22 03:11 csukuangfj

I plan to decode several different ways and report but basically focusing on streaming and pruned_transducer_statelessX, fast_beam_search

On Sun, Nov 6, 2022, 12:00 AM Fangjun Kuang @.***> wrote:

@ngoel17 https://github.com/ngoel17 Could you describe which decoding method and which file from sherpa are you using? Also, are the deletions at the end of an utterance most of the time?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/499#issuecomment-1304709277, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6A2JSWJSYT5QP74P73WG4UMBANCNFSM54XKK2VA . You are receiving this because you were mentioned.Message ID: @.***>

Nov 06 '22 04:11 ngoel17

I am looking at the deletion pattern, and it appears to be in bursts, around a meaningful cluster of words, for example, a complete phrase or a completely spelled-out abbreviation.

Nov 08 '22 18:11 ngoel17

@csukuangfj Sorry about the slowness on my end. I need to segment the audio files as they are too long for k2 to decode without segmentation (OOM even on large CPU). However it made me think that maybe what is happening (as sherpa is endpointing internally) that after certain number of endpoints, maybe one complete segment sees a very high deletion, and then the next segment onwards the things are normal again. That would probably explain what I am seeing. As another side question, is there a way to make the server handle more throughput? I could push 50 channels on a T4, but if I do more, the CPU is not hitting 100% and the GPU is not hitting 100% but some channels start dropping out because of timeout. Did you make feature extraction truly multithreaded? Where is the parameter to control feature extraction threads?

Nov 09 '22 18:11 ngoel17

Could you please try the c++ websocket server? I have tested that it can handle hundreds of concurrent connections.

Nov 09 '22 23:11 csukuangfj

I think the c++ server needs the following improvements: Client: Add sampling rate option, add packet size option (preferably don't send floats but 16 bit integers instead), don't send the entire file in one shot because for 1/2 hr long files packet size is too big.
Server: Start decoding the moment any audio is received. If the client forcefully disconnects, stop decoding even if there is a lot of pending audio, as the client_is_gone, and destroy the decoding task.

Nov 11 '22 21:11 ngoel17

@ngoel17 Could you describe which decoding method and which file from sherpa are you using? Also, are the deletions at the end of an utterance most of the time?

Do all the decoding methods have such a high deletion rate?

fast-beam-search has a high deletion rate. Deletions happen in bursts. modified-beam-search is a much lower deletions rate; the substitutions increase (i guess some deletions are becoming substitutions), but the overall WER is pretty decent. Greedy search has slightly higher WER compared to beam-search but only slightly higher. fast-beam-search is always significantly worse mainly due to the deletion problem.

Nov 17 '22 18:11 ngoel17

icefall icefall copied to clipboard

Deletions and other errors at start of utterance

icefall
icefall copied to clipboard