returnn
returnn copied to clipboard
Convergence issue with segmental length model
When using the config in: https://gist.github.com/robin-p-schmitt/41da5e1274ccb93be22881f2f1fe91ba, I have the problem that the model does not converge at all. When looking at the learning rate file:
1: EpochData(learningRate=0.0001, error={
'dev_error_ctc': 0.9583588315669095,
'dev_error_label_model/label_prob': 0.6887621695767638,
'dev_error_label_model/length_model': 0.9999999998499519,
'dev_score_ctc': 0.0,
'dev_score_label_model/label_prob': 46.05170047498763,
'dev_score_label_model/length_model': float('nan'),
'devtrain_error_ctc': 0.9618647864437795,
'devtrain_error_label_model/label_prob': 0.6914862408234509,
'devtrain_error_label_model/length_model': 1.0000000012823582,
'devtrain_score_ctc': 0.0,
'devtrain_score_label_model/label_prob': 46.051700532973975,
'devtrain_score_label_model/length_model': float('nan'),
'train_error_ctc': 0.9321049111097007,
'train_error_label_model/label_prob': 0.7323609187074502,
'train_error_label_model/length_model': 0.9986981431224986,
'train_score_ctc': 0.07518191292189359,
'train_score_label_model/label_prob': 45.89842947947897,
'train_score_label_model/length_model': float('nan'),
}),
2: EpochData(learningRate=0.00019999999999999998, error={
'dev_error_ctc': 0.9583588315669095,
'dev_error_label_model/label_prob': 0.6887621695767638,
'dev_error_label_model/length_model': 0.9999999998499519,
'dev_score_ctc': 0.0,
'dev_score_label_model/label_prob': 46.05170047498763,
'dev_score_label_model/length_model': float('nan'),
'devtrain_error_ctc': 0.9618647864437795,
'devtrain_error_label_model/label_prob': 0.6914862408234509,
'devtrain_error_label_model/length_model': 1.0000000012823582,
'devtrain_score_ctc': 0.0,
'devtrain_score_label_model/label_prob': 46.051700532973975,
'devtrain_score_label_model/length_model': float('nan'),
'train_error_ctc': 0.9308947519972764,
'train_error_label_model/label_prob': 0.7300633606144286,
'train_error_label_model/length_model': 1.000000004447767,
'train_score_ctc': 0.0,
'train_score_label_model/label_prob': 46.05170046216968,
'train_score_label_model/length_model': float('nan'),
}),
3: EpochData(learningRate=0.0003, error={
'dev_error_ctc': 0.9583588315669095,
'dev_error_label_model/label_prob': 0.6887621695767638,
'dev_error_label_model/length_model': 0.9999999998499519,
'dev_score_ctc': 0.0,
'dev_score_label_model/label_prob': 46.05170047498763,
'dev_score_label_model/length_model': float('nan'),
'devtrain_error_ctc': 0.9618647864437795,
'devtrain_error_label_model/label_prob': 0.6914862408234509,
'devtrain_error_label_model/length_model': 1.0000000012823582,
'devtrain_score_ctc': 0.0,
'devtrain_score_label_model/label_prob': 46.051700532973975,
'devtrain_score_label_model/length_model': float('nan'),
'train_error_ctc': 0.9309530333980968,
'train_error_label_model/label_prob': 0.7295836710735882,
'train_error_label_model/length_model': 1.0000000031188645,
'train_score_ctc': 0.0,
'train_score_label_model/label_prob': 46.051700434792,
'train_score_label_model/length_model': float('nan'),
}),
the scores and errors do not change at all between different epochs and also the scores of the length_model are NAN. I already looked at the targets and output of the length_model layer and they look correct (the targets are single numbers and the output is a normalized vector with 20 values). However, the problem seems to be caused by the length_model because the model is converging fine when this layer is not present.
I think the relevant layers are:
"label_model": {
"back_prop": True,
"class": "rec",
"from": "data:label_ground_truth",
"include_eos": True,
"is_output_layer": True,
"name_scope": "output/rec",
"unit": {
"length_model": {
"activation": "softmax",
"class": "linear",
"from": "length_model0",
"is_output_layer": True,
"loss": "ce",
"target": "segment_lens_target",
},
"length_model0": {
"L2": 0.0001,
"class": "rec",
"dropout": 0.3,
"from": ["non_blank_embed_128", "pooled_segment"],
"n_out": 128,
"unit": "nativelstm2",
"unit_opts": {"rec_weight_dropout": 0.3},
},
"non_blank_embed_128": {
"activation": None,
"class": "linear",
"from": "output",
"n_out": 128,
"with_bias": False,
},
"pool_segments": {"class": "copy", "from": "segments"},
"pooled_segment": {
"axes": ["stag:att_t"],
"class": "reduce",
"from": "pool_segments",
"mode": "mean",
},
"segment_lens": {
"axis": "t",
"class": "gather",
"from": "base:data:segment_lens_masked",
"position": ":i",
},
"segment_starts": {
"axis": "t",
"class": "gather",
"from": "base:data:segment_starts_masked",
"position": ":i",
},
"segments": {
"class": "reinterpret_data",
"from": "segments0",
"set_dim_tags": {
"stag:sliced-time:segments": Dim(
kind=Dim.Types.Spatial, description="att_t"
)
},
},
"segments0": {
"class": "slice_nd",
"from": "base:encoder",
"size": "segment_lens",
"start": "segment_starts",
},
},
},
"output": {
"back_prop": True,
"class": "rec",
"from": "encoder",
"include_eos": True,
"size_target": "targetb",
"target": "targetb",
"unit": {
"const1": {"class": "constant", "value": 1},
"output": {
"beam_size": 4,
"cheating": "exclusive",
"class": "choice",
"from": "data",
"initial_output": 1030,
"input_type": "log_prob",
"target": "targetb",
},
"output_emit": {
"class": "compare",
"from": "output",
"initial_output": True,
"kind": "not_equal",
"value": 1031,
},
"segment_lens": {
"class": "combine",
"from": ["segment_lens0", "const1"],
"is_output_layer": True,
"kind": "add",
},
"segment_lens0": {
"class": "combine",
"from": [":i", "segment_starts"],
"kind": "sub",
},
"segment_starts": {
"class": "switch",
"condition": "prev:output_emit",
"false_from": "prev:segment_starts",
"initial_output": 0,
"is_output_layer": True,
"true_from": ":i",
},
},
},
"segment_lens_masked": {
"class": "masked_computation",
"from": "output/segment_lens",
"mask": "is_label",
"out_spatial_dim": Dim(kind=Dim.Types.Spatial, description="label-axis"),
"register_as_extern_data": "segment_lens_masked",
"unit": {"class": "copy", "from": "data"},
},
"segment_lens_sparse": {
"class": "reinterpret_data",
"from": "segment_lens_masked",
"register_as_extern_data": "segment_lens_target",
"set_sparse": True,
"set_sparse_dim": 20,
},
Do you expect some RETURNN bug here? Usually such nan/inf issues, or convergence issues are user errors.
I see you have "set_sparse_dim": 20 for segment_lens_target.
And then:
"length_model": {
"activation": "softmax",
"class": "linear",
"from": "length_model0",
"is_output_layer": True,
"loss": "ce",
"target": "segment_lens_target",
},
You maybe should dump the actual targets. If those are outside that range, this could lead to nan.
Maybe on RETURNN side, we could add some flag like debug_extra_checks or so which then could enable an extra check here for valid indices.
If those are outside that range, this could lead to nan
Oh okay, I thought this would lead to an error in RETURNN. I will check the targets and will get back here once I know.
@robin-p-schmitt Did you ever check this? What was the result?
Or is this not relevant anymore for you? Then let's close this issue.