coref icon indicating copy to clipboard operation
coref copied to clipboard

Loss is 0 in step 200 and assertion error.

Open fairy-of-9 opened this issue 4 years ago • 15 comments

I'm really really sorry to ask you a trivial question.

train.py train_bert_base works well!


I have Korean data in conll 2012 form. they are work in e2e-coref model.

I want to apply Korean data to your model. So I did run minimize.py, and train.py But Loss is 0 in step 200 and assertion error in eval time.

assertion error log

Traceback (most recent call last):
  File "train.py", line 64, in <module>
    eval_summary, eval_f1 = model.evaluate(session, tf_global_step)
  File "/data/BERT-coref/coref/independent.py", line 559, in evaluate
    coref_predictions[example["doc_key"]] = self.evaluate_coref(top_span_starts, top_span_ends, predicted_antecedents, example["clusters"], coref_evaluator)
  File "/data/BERT-coref/coref/independent.py", line 524, in evaluate_coref
    predicted_clusters, mention_to_predicted = self.get_predicted_clusters(top_span_starts, top_span_ends, predicted_antecedents)
  File "/data/BERT-coref/coref/independent.py", line 499, in get_predicted_clusters
    assert i > predicted_index, (i, predicted_index)
AssertionError: (0, 0)

It's my experiments.conf

train_kor = ${best}{
  num_docs = 2411
  bert_learning_rate = 1e-05
  task_learning_rate = 0.0002
  max_segment_len = 128
  ffnn_size = 800
  train_path = ${data_dir}/train.kor.128.jsonlines
  eval_path = ${data_dir}/dev.kor.128.jsonlines
  conll_eval_path = ${data_dir}/dev.kor.v4_gold_conll
  max_training_sentences = 8
  bert_config_file = ${best.log_root}/multi_cased_L-12_H-768_A-12/bert_config.json
  vocab_file = ${best.log_root}/multi_cased_L-12_H-768_A-12/vocab.txt
  tf_checkpoint = ${best.log_root}/multi_cased_L-12_H-768_A-12/bert_model.ckpt
  init_checkpoint = ${best.log_root}/multi_cased_L-12_H-768_A-12/bert_model.ckpt
}

I changed the BERT model to the multi_cased. I editted num_docs to 2411. image (It's result of minimize.py. Is this correct?)

and default ffnn_size and max_training_sentences cause memory error. so I editted ffnn_size and max_training_sentences.

Is there anything I miss?

fairy-of-9 avatar Sep 01 '19 14:09 fairy-of-9

Not a trivial question at all. Thanks for your interest! The evaluation error is likely a product of the optimizer going off the rails. At a high level, I would first check if the data are correct and then move on to the optimization. A few comments and questions so that I can get a better sense of what might be wrong.

  1. Did you make sure that minimize.py uses the same vocab as your config? More generally, does train.kor.128.jsonlines seem reasonable? It should have 2411 lines and the tokenization should make sense. The values in the clusters key should also make sense.
  2. I'm not familiar with the Korean data. It would be good to check the documentation if the number of documents in the dataset is indeed 2411.
  3. One more debugging strategy would be to try the BERT multilingual model on the Ontonotes Chinese data. This is just a sanity check to verify that the multilingual model "works" with the current code. I vaguely remember that I got it to work at some point, so it should work now.
  4. If you're convinced that the data are fine, then this might be an optimization issue. Large transformer models, especially with more involved task architectures, are hard to optimize. Might be a good idea to try a bunch of learning rates.
  5. How much GPU memory do you have? Might be a good idea to reduce max_training_sentences even further and bump up ffnn_size.

mandarjoshi90 avatar Sep 01 '19 21:09 mandarjoshi90

I'm using Titan Xp.

It is train.py train_kor's log

I0902 19:18:27.344511 140548120766208 train.py:58] [0] loss=50.37, steps/s=0.00
I0902 19:18:54.144479 140548120766208 train.py:58] [100] loss=9.06, steps/s=2.47
I0902 19:19:21.534149 140548120766208 train.py:58] [200] loss=0.00, steps/s=2.94
I0902 19:19:49.008810 140548120766208 train.py:58] [300] loss=0.00, steps/s=3.14
I0902 19:20:16.368205 140548120766208 train.py:58] [400] loss=0.00, steps/s=3.26
I0902 19:20:43.680098 140548120766208 train.py:58] [500] loss=0.00, steps/s=3.33
I0902 19:21:10.876853 140548120766208 train.py:58] [600] loss=0.00, steps/s=3.38
I0902 19:21:37.686788 140548120766208 train.py:58] [700] loss=0.00, steps/s=3.43
I0902 19:22:04.828290 140548120766208 train.py:58] [800] loss=0.00, steps/s=3.46
I0902 19:22:32.054537 140548120766208 train.py:58] [900] loss=0.00, steps/s=3.48
I0902 19:22:59.043721 140548120766208 train.py:58] [1000] loss=0.00, steps/s=3.50
Loaded 248 eval examples.
2019-09-02 19:23:07.617607: W tensorflow/core/kernels/queue_base.cc:277] _0_padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
  File "train.py", line 64, in <module>
    eval_summary, eval_f1 = model.evaluate(session, tf_global_step)
  File "/data/BERT-coref/coref/independent.py", line 559, in evaluate
    coref_predictions[example["doc_key"]] = self.evaluate_coref(top_span_starts, top_span_ends, predicted_antecedents, example["clusters"], coref_evaluator)
  File "/data/BERT-coref/coref/independent.py", line 524, in evaluate_coref
    predicted_clusters, mention_to_predicted = self.get_predicted_clusters(top_span_starts, top_span_ends, predicted_antecedents)
  File "/data/BERT-coref/coref/independent.py", line 499, in get_predicted_clusters
    assert i > predicted_index, (i, predicted_index)
AssertionError: (0, 148)

sometimes loss is NaN in step 100.

I checked my train conll and jsonlines. It have 2411 documents. "clusters": [[[1575, 1582], [252, 254], [1284, 1286], [1778, 1784], [1821, 1823], [1162, 1164], [1930, 1932], [1, 15], [1084, 1090], [1587, 1589], [847, 849]], ..., [[978, 985], [925, 936]]] also make sense!

doc_key in my jsonlines is not "bc", "bn", "mz", "nw", "pt", "tc", "wb" ex) "doc_key": "1726968.json_0" Could this cause an error?

I tried Chinese data and It works... and I'm trying a bunch of learning rates..

Can you tell me what is task_learning_rate?

fairy-of-9 avatar Sep 02 '19 10:09 fairy-of-9

Hmm this is quite odd. The doc_key is basically a genre embedding, and having a different set of genres should be fine. There are two learning rates -- one for the BERT parameters and one for the task (coreference-specific) parameters. The task LR should be larger than the BERT LR since we don't want the BERT params to change a lot. But it might be a good idea to set both of them to 1e-5 just to debug.

mandarjoshi90 avatar Sep 02 '19 18:09 mandarjoshi90

I'm trying to your code every day.. sometimes, loss is fine. (I don't know this situation...) but aseertion error occurs frequently. like

I0904 23:36:20.775774 140515556620032 train.py:58] [0] loss=52.75, steps/s=0.00
I0904 23:36:48.853983 140515556620032 train.py:58] [100] loss=56.96, steps/s=2.49
I0904 23:37:16.643731 140515556620032 train.py:58] [200] loss=106.59, steps/s=2.94
I0904 23:37:44.761991 140515556620032 train.py:58] [300] loss=167.72, steps/s=3.12
I0904 23:38:13.796795 140515556620032 train.py:58] [400] loss=174.39, steps/s=3.20
I0904 23:38:42.192664 140515556620032 train.py:58] [500] loss=126.75, steps/s=3.26
I0904 23:39:10.581388 140515556620032 train.py:58] [600] loss=113.55, steps/s=3.30
I0904 23:39:38.504084 140515556620032 train.py:58] [700] loss=59.20, steps/s=3.34
I0904 23:40:07.391193 140515556620032 train.py:58] [800] loss=79.45, steps/s=3.3                                                                                                                                               5
I0904 23:40:35.656905 140515556620032 train.py:58] [900] loss=45.97, steps/s=3.37
I0904 23:41:03.529552 140515556620032 train.py:58] [1000] loss=42.10, steps/s=3.39
Loaded 248 eval examples.
Evaluated 1/248 examples.
2019-09-04 23:41:11.234775: W tensorflow/core/kernels/queue_base.cc:277] _0_padding_fifo_queue: Skipping                                                                                                                        cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
  File "train.py", line 66, in <module>
    eval_summary, eval_f1 = model.evaluate(session, tf_global_step)
  File "/data/BERT-coref/coref/independent.py", line 568, in evaluate
    coref_predictions[example["doc_key"]] = self.evaluate_coref(top_span_starts, top_span_ends, predicte                                                                                                                       d_antecedents, example["clusters"], coref_evaluator)
  File "/data/BERT-coref/coref/independent.py", line 533, in evaluate_coref
    predicted_clusters, mention_to_predicted = self.get_predicted_clusters(top_span_starts, top_span_end                                                                                                                       s, predicted_antecedents)
  File "/data/BERT-coref/coref/independent.py", line 508, in get_predicted_clusters
    assert i > predicted_index, (i, predicted_index)
AssertionError: (0, 606)

Can you explain this Assertion? I think, This assertion prevents the latter entity from being considered a predecessor. right..? Is it essential?

fairy-of-9 avatar Sep 04 '19 14:09 fairy-of-9

From a quick glance, I think so. I suspect getting rid of it will mess up the evaluation, but you can still try it out. The official coref evaluation is rather complicated in that it needs to call perl scripts. It might be a good idea to turn those off (eval_mode=False in independent.py) and just use the unofficial eval in python.

The problem seems to be in andecedent_scores. I would check that variable to see why it's making those predictions. If I understand this right, subtoken 0 should not have any antecedents and should not be a mention since it's the CLS token. Perhaps all mention/antecedent scores are the same? Or your clusters have CLS in them as a mention. I would check all those variables/data to see what's going on.

mandarjoshi90 avatar Sep 05 '19 02:09 mandarjoshi90

Thank you so much for your effort..

fairy-of-9 avatar Sep 05 '19 13:09 fairy-of-9

Closing due to inactivity. Please feel free to reopen if something comes up. Thanks!

mandarjoshi90 avatar Sep 11 '19 01:09 mandarjoshi90

As you think, The problem seems to be in andecedent_scores. So I checked andecedent_scores.

While I was checking the code, I found something I didn't understand. I think here is a bug.

def coarse_to_fine_pruning(self, top_span_emb, top_span_mention_scores, c):
    k = util.shape(top_span_emb, 0)
    top_span_range = tf.range(k) # [k]
    #[0,1,...k-1]

    antecedent_offsets = tf.expand_dims(top_span_range, 1) - tf.expand_dims(top_span_range, 0) # [k, k]
    '''
    0 -1 -2 -3 ... -k+1   <<[cls]
    1 0 -1 -2 ...         << 1st real-text token
    2 1 0 ...
    ...
    k-1 k-2 k-3 ...  0
    '''

    antecedents_mask = antecedent_offsets >= 1 # [k, k]
    '''
    F F F F ... F  
    T F F F ... F
    T T F F ... F                << I think should be changed.
    ...
    T T T T ... F

    '''

I think antecedents_mask should be changed to

F F F F ... F  
F F F F ... F
F T F F ... F        
 ...
F T T T ... F         
#antecedents_mask[0][*] = False

because, token[0] is [cls].

and.... I face loss is NaN and assert i > predicted_index everyday.... T.T


Can you re-open this issue? I want to hear other people's opinions.

fairy-of-9 avatar Sep 25 '19 14:09 fairy-of-9

@fairy-of-9 Did you try training with that change?

CCing @freesunshine0316 who might be interested.

mandarjoshi90 avatar Sep 25 '19 21:09 mandarjoshi90

There is not change in code. I just use korean data that work well on e2e model.

fairy-of-9 avatar Sep 26 '19 01:09 fairy-of-9

I want to know which document loss == NaN occurs.

It's train.py

 while True:
      tf_loss, tf_global_step, _ = session.run([model.loss, model.global_step, model.train_op])
      accumulated_loss += tf_loss
      if math.isnan(tf_loss):
            print current doc_key.        << but.. I dont have an idea to work this line.

Can you give me some advice on this?

fairy-of-9 avatar Sep 26 '19 12:09 fairy-of-9

I'm AFK for the weekend. Off the top of my head, one way would be to add an integer document ID to input_props. You can then print or tf.Print that with the loss in both the dev and train loops. I can take a closer look when I get back.

mandarjoshi90 avatar Sep 27 '19 01:09 mandarjoshi90

@fairy-of-9 Hi I have a similar concern here. Can you check the code behavior when the offsets are feed into the "bucket_distance" function. I think the negative values will be a problem.

freesunshine0316 avatar Sep 27 '19 05:09 freesunshine0316

@mandarjoshi90 Thanks a lot!

@freesunshine0316 I will check the function on Monday!

fairy-of-9 avatar Sep 27 '19 07:09 fairy-of-9

Hi, Thanks @mandarjoshi90 and everyone for sharing your efforts with us.

I have a short question regarding training this model on another language, and I don't want to create a separate issue for this. Is it required to change the vocab.txt file and fill it with the chosen language's words? (I think this one is quite trivial :D, but this is my first experience using BERT/SpanBERT)

AradAshrafi avatar Mar 31 '20 16:03 AradAshrafi