DiDev

Results 35 comments of DiDev

That is great news. I'm going to give it a whirl! Edit: Yes definitely faster. Seeing 3x improvement with same settings!

btw, I notice that even with rayon while all my hyper-threads are engaged they are not hitting 100% usage. I tried increasing the threads but it made it slower. Not...

This is after 9 hours of training. I think the words are starting to make sense. Model is 6L/6H. Loss is continuing to reduce nicely. As I mentioned earlier, I...

The loss was hovering between 1.9 and 2.1 for above result. I stopped it after a couple of hours to try the nanogpt for comparison. Here are the observations: 1)...

Sent it. Hope you can figure it out. I'm going to run this same settings model for a couple of days continuous to see if the loss will eventually reduce...

Ran it for 2 days or so. Unfortunately the loss hasn't fallen below 1.7 (it's below 1.8 consistently). Sample output: ``` Giring city. Wherer I is thou losan, You come....

Not yet. That's one of the reason wanted to restart the training. Hope you had a chance to look into the optimizer issues. Actually I noticed that removing the weight...

You are correct. Anyways despite jitter, the loss does reduce properly. I'm using learning rate decay like this: ``` let mut nlr = self.learning_rate - ((self.learning_rate - self.min_lr_rate) *(curr_step as...

May be medium/substack or some other normal blog sites is better. You can also serve it from this repo it self! :)

Yes. saw that :) So if I'm understanding this correctly, to add a new layer, I stop the training, set optimizer to false, increase the layers and restart the training...