visdial-challenge-starter-pytorch Shared memory issues with parallelization

Shared memory issues with parallelization

Open shubhamagarwal92 opened this issue 4 years ago • 5 comments

Hi @kdexd

I am running into all kinds of shared memory errors after this commit 9c1ee36b85c2c63d554471cac2825cf0b9cf2efd

https://github.com/pytorch/pytorch/issues/8976 https://github.com/pytorch/pytorch/issues/973

I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:

torch.multiprocessing.set_sharing_strategy('file_system')

# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048*4, rlimit[1]))

Is there a leak somewhere? Might be best to have a look.

Aug 03 '19 01:08 shubhamagarwal92

Thanks for reporting this! I will try to reproduce this on my end and see where it breaks.

Aug 04 '19 07:08 kdexd

My hunch was that if we use parallelization in pytorch's data loader and still do multiprocess tokenization in one process, it is giving those errors. Basically, it tries to do in tokenization in (cpu_workers * cpu_workers) processes (?) and thus eating up shared memory (?)

I have removed this multiprocess tokenization and running some experiments. Will let you know how it goes. Your suggestion is also appreciated.

Aug 04 '19 18:08 shubhamagarwal92

My hunch was that if we use parallelization in pytorch's data loader and still do multiprocess tokenization in one process, it is giving those errors. Basically, it tries to do in tokenization in (cpu_workers * cpu_workers) processes (?) and thus eating up shared memory (?)

Both of these happen at different times. All the tokenization happens in the reader even before the training starts (when the dataloader makes batches).

I am unable to reproduce this so far. Could you check if it works well with 1 worker?

Aug 06 '19 04:08 kdexd

Yeah I did try with 1 worker. Had the same errors. (Cant use 0 because this requires at least one worker :D )

Have removed multiprocess tokenization in my code and it works fine.

Just to let you know it doesn't happen at starting iterations or epochs. I guess it was after 3-5 epochs.

Aug 06 '19 17:08 shubhamagarwal92

I think I'm hitting this.

In my setup I'm doing independent runs in parallel threads (not processes, since I'm using LevelDB and it does not support multiprocessing). Sometimes it breaks with the error:

RuntimeError: received 0 items of ancdata

Even though I'm using the workaround suggested here: https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667

Feb 11 '20 14:02 lucmos

visdial-challenge-starter-pytorch visdial-challenge-starter-pytorch copied to clipboard

Shared memory issues with parallelization

visdial-challenge-starter-pytorch
visdial-challenge-starter-pytorch copied to clipboard