cmc-csci181 icon indicating copy to clipboard operation
cmc-csci181 copied to clipboard

Taking Very Long

Open raequan opened this issue 5 years ago • 13 comments

Running Bert on my computer has taken extremely long; I have only 2.4K steps after 12 hours. Are there any ways to optimize the speed of this? This is the only application I have running as well, as my computer says I have no application space when I run more than 2 scripts at a time.

raequan avatar May 14 '20 19:05 raequan

Yes, BERT is about 100x slower than the other models we have used. Training the model to convergence will take several days on a normal laptop.

Fortunately, you are not required to train to convergence. You only have to train long enough to reach the accuracy levels in the assignment, which should take 2-3 hours with a good choice of hyperparameters.

mikeizbicki avatar May 14 '20 19:05 mikeizbicki

Do you happen to have Zoom office hours today to discuss this?

raequan avatar May 14 '20 19:05 raequan

I'm on office hours right now.

mikeizbicki avatar May 14 '20 21:05 mikeizbicki

I'm waiting to be let inside

raequan avatar May 14 '20 21:05 raequan

@mikeizbicki Do you still have any time today after office hours? I am very stuck as of where to begin implementing some of the code.

benfig1127 avatar May 14 '20 22:05 benfig1127

@raequan Sorry, I had to leave for another meeting before you showed up.

@raequan @benfig1127 I'll have office hours tomorrow morning at 9am.

mikeizbicki avatar May 15 '20 03:05 mikeizbicki

@mikeizbicki sounds good, I managed to solve some of the issues but still had a few questions, so I will swing by. Thanks!

benfig1127 avatar May 15 '20 03:05 benfig1127

@mikeizbicki I have three questions:

  1. What is the lowest smoothing value we can use on tensor board? When I have my value set to .999 I am much farther from the targets when compared to using the smoothing value .984. Would it matter what smoothing value we use, or should we reach these benchmarks with a smoothing value of 0.999. I understand what the smoothing value does (in terms of plotting), but my computer is taking longer to run and I predict that if it hits that value with a lower smoothing value, it is bound to with a higher smoothing value with more samples.

  2. How many runs do we need to have on our tensor board upload? Would it be fine for me to only upload the light blue line on my tensor board.dev?

  3. My loss value increased then peaked and slowly started to decrease after its only peak. This is indicative of my loss working, correct?

Lastly, you can see how long my BERT is taking to train; specifically the blue line, 22 hours to get to 20% but it is working. I'm running nothing else. Is this normal?

Here is my tensorboard.dev so you can see what I mean: https://tensorboard.dev/experiment/yTzAPgNjRHKEjS5vybEn9A/

raequan avatar May 15 '20 06:05 raequan

  1. Any smoothing value is fine. 0.99 would be a good choice.

  2. You only need a single run. There is no need for warm starting.

  3. Your loss value is pretty high. It should typically be much lower than where it started at. For this problem, however, I'm not grading your loss value. I'm only grading your accuracy.

  4. With better hyperparameters, you could get it to converge to the required values in just 2-3 hours. But you will not be graded on the runtime.

mikeizbicki avatar May 15 '20 06:05 mikeizbicki

Thank you! This was all good to know. Last question I believe I have:

I have implemented the embed function as my code runs. I think I got it to work, but we haven't looked at projections before. Given a very poor run for the warm start, I got the following projection on my tensor board.dev:

Screen Shot 2020-05-15 at 12 16 42 AM

Does this look correct? I plan to redo it with my model that is working once it reaches the benchmarks, but want to know if this is how it should be looking. I think it's what we should have because my search of labels shows points related to news articles, but just want to be sure. Can ou explain the graph?

raequan avatar May 15 '20 07:05 raequan

You'll want to use tsne, and you'll need to adjust the hyperparameters until you get some clusters forming.

mikeizbicki avatar May 15 '20 07:05 mikeizbicki

When II reach the benchmarks, presumably there will be clusters correct?

raequan avatar May 15 '20 07:05 raequan

I think you could probably already have clusters if you use the tsne algorithm with the right paramters.

mikeizbicki avatar May 15 '20 15:05 mikeizbicki