deep-code-search Memory Issue

Hello @guxd, I tried downloading the real dataset from google drive and training the model for 2 epochs. It worked fine for it. When I am trying code embedding with the last epoch as an optimal checkpoint, the cell is getting terminated after running for some time. When I searched in google they said it might be because of a RAM issue and suggested upgrading the RAM. Is there any other way around that could work, like decreasing the batch_size or chunk_size or any other parameter? (currently 'batch_size': 100,'chunk_size':100000 )

Update: I tried decreasing the batch_sizes to 100, 64, but still I am facing the same issue.

codeembed_error

Oct 22 '21 01:10 saicharishmavalluri

How about reducing chunk_size? You can track the variable vecs and check whether it is allocated with memory after calling vecs = []. You can also try to use a small codebase given you have limited memory.

Oct 22 '21 05:10 guxd

How about reducing chunk_size? You can track the variable vecs and check whether it is allocated with memory after calling vecs = []. You can also try to use a small codebase given you have limited memory.

@guxd When you say small codebase, does that mean using a dummy dataset instead of the real dataset? Also, during preprocessing step, how did you extract the <method name, API sequence, tokens, description> tuples from the java code snippets?

Oct 22 '21 22:10 saicharishmavalluri

I mean using a subset of the use.XXX.h5 from Google drive. For example, using only 1 million code snippets.

Oct 24 '21 06:10 guxd