UCPhrase-exp icon indicating copy to clipboard operation
UCPhrase-exp copied to clipboard

Getting "Killed" error when run on kp20k Dataset

Open nanthamanish opened this issue 3 years ago • 5 comments

UCPhrase-error (2)

When I run python exp.py --gpu 0 --dir_data ../data/kp20k, I get the Killed error. While debugging I found the error happens at the return torch.tensor(padded_tensor, dtype=torch.float32) (line 66 of src/model_att/model.py)

Is there a way to fix this?

nanthamanish avatar Jun 26 '21 12:06 nanthamanish

Hi, looks like a resource issue, if the job was not manually terminated by a server admin. Could you please check the memory usage and the swap space? If it was due to the lack of resource, you can try to reduce the batch size at this line. Please have a try and let us know if the issue still exists, and we will see if we can optimize the resource usage (sorry the experimental code is a bit messy now). Thanks!

xgeric avatar Jun 26 '21 18:06 xgeric

Hi! I did try reducing the batch size (from 2048 till 4), but I got the same error again. On kp20k, the numpy array has the dimensions of (6529257, 36, 10, 10) resulting in a total size of 23505325200. And the process gets killed when converting this to a torch tensor. Let me know if there is an optimization possible to run it on my system. Thanks :)

nanthamanish avatar Jun 29 '21 04:06 nanthamanish

Thank you for the feedback! Sorry it was not the batch size that matters, since at this step we were loading all training instances to memory, so it may exceed memory limit. To verify this, you check the memory usage, or try to add instances = instances[:100] to see if the error goes away. It would help if you can test it on a machine with larger memory for now.

xgeric avatar Jun 29 '21 05:06 xgeric

Hi! Actually, I also encounter this problem when I run on kpTimes dataset. I used psutil to monitor the memory. And I'm really confused that why 76GB free memory is not enough. I changed the pad_attention_maps function in src/model_att/model.py and the code is as follows: image

And I got the result like that: 2c9cdedcad305cb78774b60d865f6b1 It showed that the assignment of padded_tensor cost 40GB, which is also a little bit confusing to me. And the code final=torch.tensor(padded_tensor, dtype=torch.float32) just run out of memory and caused the program be killed.

possible1402 avatar Aug 26 '21 12:08 possible1402

I think it is mainly because we loaded the entire training dataset to memory for experiments. It wasn't a bottleneck for our server but might be an issue without sufficient free memory. We will try to find time to fix it.

xgeric avatar Aug 26 '21 18:08 xgeric