magpie icon indicating copy to clipboard operation
magpie copied to clipboard

How long do you need to wait for the data-preprocessing before the GPU get used?

Open datduong opened this issue 6 years ago • 5 comments

How long do you need to wait for the data-preprocessing before the GPU get used? I am training a data set (58,000 .txt files), after nearly 30 mins, I do not see my GPU get used. Its activity level is at 0%.

Is it possible to do only the data-preprocessing on a strong CPU-only server, and then load the saved pickle into a GPU node?

Thanks.

datduong avatar Apr 08 '18 21:04 datduong

Using of the GPU depends on your setup, in particular of your TensorFlow/Theano installation. If you're using TensorFlow, make sure it is configured to leverage your GPU and if that's the case, Magpie will do it automatically. You can find more info on setting up your GPU with TF here and here.

jstypka avatar Apr 09 '18 08:04 jstypka

Hi, sorry for not being clear. I already installed the tensorflow-gpu version. I tested it with pytorch and do see the GPU being used. When I run Magpie, I don't see the GPU being used at all. I suspect that the data preprocessing may be taking a long time in CPU. So, I wanted to ballpark how long does the code takes for data preprocessing (i.e. preparing the vocab dictionary) before it reaches the steps for GPU.

Thanks.

datduong avatar Apr 10 '18 19:04 datduong

Building Word2vec embeddings can take quite a while (more than 30 mins for a large corpus on a laptop). You should do it only once though and store them afterwards, so that you don't repeat this step. Going straight to batch training when you have your vectors built and loaded shouldn't take long, couple of mins tops.

jstypka avatar Apr 10 '18 20:04 jstypka

Sorry for the late reply. Thanks for this response. I guess the slow part is training w2v. I will do this step separately on a stronger server. Is "easy" to change the code so that it takes pre-trained w2v or GloVe word vectors? I guess "easy" is loosely defined here.

datduong avatar Apr 17 '18 18:04 datduong

If you use the gensim w2v model, it's very easy to pass it in as a parameter. If you want to pass some other implementation of the embeddings, it's not officially supported, but possible to hack it in due to Python ultraflexible type system. This code uses the w2v model - if you pass something that exposes the same API, it should work.

jstypka avatar Apr 17 '18 20:04 jstypka