Jai Mu

Results 7 comments of Jai Mu

Assuming you're using the train.py from [nsheppard's fork](https://github.com/nshepperd/gpt-2/), try running it with `--save_every N` where N is the number of steps before it auto-saves (default 1000). For example: `python train.py...

Did you change the "data.npz" to point to where your dataset is? Or better yet, try running the same train.py command as in the original post and just add `--save_every...

Make sure you're properly formatting the data with and between samples otherwise it will think that it's one continuous stream and should continue like that.

On a single 1070? I don't think that's possible. I'm currently training a 10GB dataset using the 345M model on a 3090 and it's using ~17GB on VRAM ![345MTraining](https://user-images.githubusercontent.com/14964859/108821213-cc08af00-7604-11eb-88c3-dfa4ed16fe81.PNG)

I honestly couldn't recommend getting a 3090 _just_ for training/fine-tuning 345M gpt-2. 117M Is definitely good enough for every use case (for me anyway) if your 1070 can handle that....

Thank you @jarred1989 that fixed crashing on the first few steps. However, I'm now getting `Loss exploded to 19443163401595046756089856.00000 at step 196.00000, avg_loss=1023324389557634032730112.00000]` or `Exiting due to exception: Found Inf...

I am using this repo which took inspiration from here https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/tree/main/finetuning_repo