Medical-Report-Generation icon indicating copy to clipboard operation
Medical-Report-Generation copied to clipboard

About _epoch_train and _epoch_val

Open fireholder opened this issue 5 years ago • 21 comments

When i was traning, I've met a problem that the progress came to a standstill. And I've found that it was the function _epoch_train and _epoch_val stopped it, which raises NotImplementedError. I wonder why and how to fix it.

fireholder avatar Aug 05 '19 01:08 fireholder

hi, bro, I am trying to run the trainer.py, but I don't know about the argument "--load_model_path", there is nothing in the current folder, I am sure what kind of pretrain model need to load here, any advise?

Ike-yang avatar Aug 05 '19 08:08 Ike-yang

I think '--load_model_path' is only used when 'pretrained', but the log.txt shows error when not loading model files.

fireholder avatar Aug 05 '19 08:08 fireholder

Exactly, I got something in the logs.txt file like this : Vocab Size:1173 [Load Model Failed] [Errno 2] No such file or directory: '' [Load Model Failed] [Errno 21] Is a directory: '.' [Load MLC Failed [Errno 21] Is a directory: '.'!] [Load Co-attention Failed [Errno 21] Is a directory: '.'!] [Load Sentence model Failed [Errno 21] Is a directory: '.'!] [Load Word model Failed [Errno 21] Is a directory: '.'!] Namespace(attention_version='v4', batch_size=16, caption_json='./data/new_data/.......

I thought program just stop here because of the error message. So, I could just ignore the message, and keep training? Are there other places need to be modified?

Ike-yang avatar Aug 05 '19 10:08 Ike-yang

I find that it's not stopped, it's just not printed.

fireholder avatar Aug 05 '19 11:08 fireholder

Yeah, I leave it to run all night, but I found val_loss is always 0 in logs.txt, there must something wrong and need to be modified

Ike-yang avatar Aug 06 '19 02:08 Ike-yang

Because in '_epoch_val' all val loss is set to 0, you can try uncomenting the code in '_epoch_val'. But I find my train loss very large, is it the same to you? By the way, have you tried the tester

fireholder avatar Aug 06 '19 02:08 fireholder

Yes, extremely large train loss. Haven't tried the tester yet

Ike-yang avatar Aug 06 '19 03:08 Ike-yang

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

Ike-yang avatar Aug 07 '19 04:08 Ike-yang

Yes, just convert to tensor.cpu() as the error suggested.

fireholder avatar Aug 07 '19 08:08 fireholder

However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"[email protected]; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"[email protected]; 抄送: "横舟"[email protected]; "Author"[email protected]; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

fireholder avatar Aug 09 '19 13:08 fireholder

I have the same caption too. Can you find the reason?------------------ 原始邮件 ------------------ 发件人: "xwt"[email protected] 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"[email protected]; 抄送: "Subscribed"[email protected]; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"[email protected];
发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"[email protected];
抄送: "横舟"[email protected]; "Author"[email protected];
主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7)

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Cao-Shuang avatar Aug 10 '19 00:08 Cao-Shuang

not yet

fireholder avatar Aug 10 '19 01:08 fireholder

When I run python tester.py

FileNotFoundError: [Errno 2] No such file or directory: './data/new_data/debug_vocab.pkl'

ShivamPanchal avatar Sep 08 '19 17:09 ShivamPanchal

Did u guys met the problem like"

WARNING:tensorflow:From /content/drive/Shared drives/shared drive-zma/ACL18/utils/logger.py:15: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Traceback (most recent call last): File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 662, in debugger.train() File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 60, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() #??? File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 402, in _epoch_train batch_tag_loss = self.mse_criterion(tags, self._to_var(label, requires_grad=False)).sum() # ??? File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 431, in forward return F.mse_loss(input, target, reduction=self.reduction) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2203, in mse_loss expanded_input, expanded_target = torch.broadcast_tensors(input, target) File "/usr/local/lib/python3.6/dist-packages/torch/functional.py", line 52, in broadcast_tensors return torch._C._VariableFunctions.broadcast_tensors(tensors)

RuntimeError: The size of tensor a (210) must match the size of tensor b (0) at non-singleton dimension 1 " it's really make me confused, anyone could do me a favor? Thx!

CinKKKyo avatar Nov 11 '19 13:11 CinKKKyo

However , My test results are all the Same. All my predicted captions are the same ------------------ 原始邮件 ------------------ 发件人: "Ike-yang"[email protected]; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"[email protected]; 抄送: "横舟"[email protected]; "Author"[email protected]; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Hi @fireholder! Did you eventually give up trying to solve the issue? were all the predicted captions always identical?

mfilipav avatar Dec 03 '19 11:12 mfilipav

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

yangyan22 avatar Apr 26 '20 04:04 yangyan22

Yes, extremely large train loss. Haven't tried the tester yet

Hi, you were able to decrease the loss. I am also facing the same issue.

AnkitMalviya avatar May 17 '20 00:05 AnkitMalviya

I have the same caption too. Can you find the reason?------------------ 原始邮件 ------------------ 发件人: "xwt"[email protected] 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"[email protected]; 抄送: "Subscribed"[email protected]; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same ------------------ 原始邮件 ------------------ 发件人: "Ike-yang"[email protected]; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"[email protected]; 抄送: "横舟"[email protected]; "Author"[email protected]; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

I am also facing the same issue. Are you able to solve this?

AnkitMalviya avatar May 17 '20 00:05 AnkitMalviya

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

I guess train loss is large, because author uses MSELoss for predicting tags. If he has 156 different tags, then the exponent ~ (156-0)^2 = 24336. That is why so big loss

You can change it L1Loss or decrease lambda argument for tags loss (if you find it reasonable).

Alsalivan avatar Sep 04 '20 14:09 Alsalivan

In debugger.py and tester.py file of the given project. I'm facing an error at 3rd last line in the following given section of code. ` tag_loss += self.args.lambda_tag * batch_tag_loss.data stop_loss += self.args.lambda_stop * batch_stop_loss.data word_loss += self.args.lambda_word * batch_word_loss.data loss += batch_loss.data

return tag_loss, stop_loss, word_loss, loss`

Error is : File "D:/Hareem/Auto_report/debugger.py", line 61, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() File "D:/Hareem/Auto_report/debugger.py", line 424, in _epoch_train word_loss += self.args.lambda_word * batch_word_loss.data AttributeError: 'int' object has no attribute 'data'

Hareem1997 avatar Sep 20 '20 08:09 Hareem1997

Is there anybody who solve the problem predicting captions all the same?

domyown avatar Dec 08 '21 12:12 domyown