pytorch_RVAE icon indicating copy to clipboard operation
pytorch_RVAE copied to clipboard

train.py memory problem

Open transfluxus opened this issue 7 years ago • 7 comments

is there a way to use a word embedding genereted with something else (like gensim for example). This implementation dies after a while on my relatively large data set (with 32gb of memory)

transfluxus avatar Apr 25 '17 06:04 transfluxus

What do you mean by "dies after a while"? There are no restricts on the nature of word embeddings –– you just have to save it in appropriate file and Embedding module will pick up them

kefirski avatar Apr 25 '17 06:04 kefirski

it says. 'killed' after 20 minutes max

transfluxus avatar Apr 25 '17 08:04 transfluxus

the output of train are several files: characters_vocab.pkl, train_character_tensor.npy, train_word_tensor.npy, valid_word_tensor.npy, words_vocab.pkl, valid_character_tensor.npy and word_embeddings.npy which one do I need for the next steps?

transfluxus avatar May 01 '17 11:05 transfluxus

I think "dies after a while" is because the seq_len is too long. I have encountered this sometimes and it's alright after I reduced the length of each corpus sentence.

xushenkun avatar Jul 04 '17 07:07 xushenkun

interesting. It's a while ago so I don't remember if I used a sentence of a whole document as a sentence. but I guess i used sentences, so how would I chop them?

transfluxus avatar Jul 04 '17 12:07 transfluxus

@transfluxus I used Chinese corpus and it should be less than 300 words in each sentence; or crashed. I think it should be less than 1000 words for English corpus. I just split the sentence when encountered commas or full stops.

xushenkun avatar Jul 13 '17 00:07 xushenkun

i limited the sentence length to 100, still doesn't run through. actually already the train_word_embedding fails. loading the whole corpus and then creating multiple representations of it is not really practical if your corpus has a real size (4.2mio sentences in my case). it's gotta be streamed

transfluxus avatar Jul 17 '17 19:07 transfluxus