mm-cot icon indicating copy to clipboard operation
mm-cot copied to clipboard

Validation (prediction) phase, server jammed.

Open HarveyYi opened this issue 2 years ago • 6 comments

I can train the model in the first phase, but when it comes to validating, the server will get stuck.

NombreParaImagen_001

The server configuration is as follows:

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Stepping: 7 CPU MHz: 1408.689 CPU max MHz: 3500.0000 CPU min MHz: 1000.0000

GPU: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:18:00.0 Off | N/A | | 30% 41C P8 35W / 350W | 17947MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A | | 30% 37C P8 32W / 350W | 13549MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

If you run the program in the state shown above, it will hang with a high CPU and memory usage and a low GPU usage.

HarveyYi avatar Feb 21 '23 08:02 HarveyYi

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).

This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

gianfrancodemarco avatar Mar 01 '23 08:03 gianfrancodemarco

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller).

This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications? Thank you very much.

zhongfansun avatar May 31 '23 08:05 zhongfansun

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller). This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications? Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

gianfrancodemarco avatar May 31 '23 08:05 gianfrancodemarco

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller). This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications? Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

I don't know why it doesn't work for me. I replaced the entire class ScienceQADatasetStd and ScienceQADatasetImg with the one you provided. But the same problem occurred image

zhongfansun avatar May 31 '23 09:05 zhongfansun

I've just managed to reproduce the prediction step. I've had to move every tensor to the GPU, because they were defaulting on the CPU. I don't know how it worked in the original research work... However, the memory usage problem is due to the fact that when the predict method is called, each of the predicted tensor is kept in memory, and each of them is very heavy. To solve this, i modified the inference procedure to loop on a small batches of data and decode them (the decoded version is a lot smaller). This problem is also caused by the fact that all of the data (included the train set, 3x the size of the eval set) is loaded, even if you would only need the eval set.

I have encountered the same problem, and RAM 125.50GB is also not enough. I would like to know which data you are storing on the GPU. Could you please provide more detailed modifications? Thank you very much.

You can find them here and in the rest of the repo: https://github.com/gianfrancodemarco/mm-cot/blob/main/src/data/scienceQA/dataset_std.py

I am studying the fork you provided. Could you provide the running configuration of the scienceQA dataset about https://github.com/gianfrancodemarco/mm-cot/blob/main/experiments/run_experiments.py Looking forward to your reply. Thank you very much.

zhongfansun avatar Jun 01 '23 06:06 zhongfansun

@zhongfansun i don't think you need to use run_experiments.py. You'll find the relevant configurations here: https://github.com/gianfrancodemarco/mm-cot/blob/main/.vscode/launch.json

gianfrancodemarco avatar Jun 01 '23 22:06 gianfrancodemarco