spade icon indicating copy to clipboard operation
spade copied to clipboard

Training on FUNSD: Cuda out of Memory on GPU with 12Gb memory.

Open asidharth019 opened this issue 3 years ago • 6 comments

First of all, congratulations to the entire team on the amazing work.

I was trying to train SPADE on the FUNSD dataset on GPU with 12Gb mem (GeForce RTX 208 Ti). But getting RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 10.75 GiB total capacity; 9.14 GiB already allocated; 24.25 MiB free; 9.42 GiB reserved in total by PyTorch)

Is it at all possible to train SPADE on a GPU with 12Gb mem.? Comments in another issue says that it needs GPU with at least 24Gb mem. https://github.com/clovaai/spade/issues/2#issuecomment-915036284

Help will be appreciated. Thanks

asidharth019 avatar Jan 17 '22 17:01 asidharth019

Hi @asidharth019

You may turn off relative_attention (see this comment )

Or, you may use smaller encoder, for example, bert-base-multilingual-cased-(3 or 4)layers (please refer to this comment)

Good luck!

whwang299 avatar Jan 17 '22 22:01 whwang299

Thanks for the above solution. I was able to run the training. 🙂

Facing a few issues with the output for FUNSD:-

  • I have trained using the funsd_config. Not getting the output in the expected format shown in the CORD example, The output of sample from FUNSD dataset "data_id": "83594639":- Why we are getting a dictionary within the list? Also, not much linked entities. [{"{'qa.question': 'Date:'}": [[{'qa.answer': 'September 15: 1997'}]]}, [{'qa.question': 'Company:'}], [{'qa.question': 'From:'}], [{'qa.question': 'DATA'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Fax'}], [{'qa.question': 'Advertising'}], [{'qa.question': 'Media'}], [{'other.other': '7707'}]]

  • It is mentioned that we should not use Val score for model selection. Please guide on what to use for model selection.

asidharth019 avatar Jan 22 '22 16:01 asidharth019

  • The reason why the FUNSD output format is different compared to CORD is because of the difference in the depth of information. Please refer to the Table 2. Also, the examples from FUNSD often consist of not fully filled documents. Check the original document image.

  • Also, if the parse above represents the "prediction" check the ground truth output first.

  • By the way, set toy_data: false in the config. See below.- https://github.com/clovaai/spade/blob/a85574ceaa00f1878a23754f283aa66bc2daf082/configs/funsd.1.5layers.train.yaml#L79

  • For the model selection, use "early stopping".

Best,

whwang299 avatar Jan 22 '22 22:01 whwang299

I have already set toy_data: false. What will be the best way to apply "early stopping" in this training process?

Samples count:-

  • Train: 149
  • Val: 8
  • Held out Test: 50 The above split is according to the given train YAML and the data for FUNSD is generated using provided preprocess file.

On Train & Val, I am getting decent performance on ELK but the performance on the Held-out test set is very bad, Score Dict for held-out test {"test__avg_loss": 1.0109899044036865, "test__f1": -1, "test__precision_edge_avg": 0.26012873043052837, "test__recall_edge_avg": 0.09196836541370143, "test__f1_edge_avg": 0.1347639744054087, "test__precision_edge_of_type_0": 0.37181996086105673, "test__recall_edge_of_type_0": 0.14822244511311713, "test__f1_edge_of_type_0": 0.2119521912350598, "test__precision_edge_of_type_1": 0.1484375, "test__recall_edge_of_type_1": 0.03571428571428571, "test__f1_edge_of_type_1": 0.05757575757575757, "p_r_f1_entity": [[0.3888888888888889, 0.1721311475409836, 0.23863636363636365], [0.8102941176470588, 0.5116063138347261, 0.6272054638588503], [0.7383367139959433, 0.44336175395858707, 0.5540334855403348], [0.5359477124183006, 0.26282051282051283, 0.35268817204301073]], "p_r_f1_all_entity_ELB": [0.7376811594202899, 0.4365351629502573, 0.5484913793103449], "p_r_f1_link_ELK": [0.3392857142857143, 0.03571428571428571, 0.06462585034013606]}

Please guide

asidharth019 avatar Jan 23 '22 07:01 asidharth019

Hi @asidharth019

Sorry for being late in reply. You may increase the number of training epochs? As far as I remember, you should get near 100% accuracy on training set.

Also, please be aware of that in case of FUNSD, the validation set is a subset of the training set. See README/Model/Training section.

Wonseok

whwang299 avatar Mar 20 '22 00:03 whwang299

If relative_attention is closed, are the highlights mentioned in the paper meaningless?

DYF-AI avatar Jun 22 '22 15:06 DYF-AI