gpt-2
gpt-2 copied to clipboard
Checkpoint not generating
I ran the train.py
program of GPT-2 on a txt training data which has 3 stories. I used the 117M parameters model, and it runs, it trains the model, but once it stops it creates checkpoint folder inside it is run1 folder, but none of these files are generated:
- checkpoint
- model-xxx.data-00000-of-00001
- model-xxx.index
- model-xxx.meta
Use standard file APIs to check for files with this prefix. Loading dataset... 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.27it/s]
dataset has 12863 tokens
Training...
[1 | 22.35] loss=3.69 avg=3.69 [2 | 40.40] loss=3.48 avg=3.58 [3 | 72.00] loss=3.34 avg=3.50 [4 | 91.34] loss=3.45 avg=3.49 [5 | 111.14] loss=3.32 avg=3.45 [6 | 130.68] loss=3.63 avg=3.48 [7 | 146.00] loss=3.35 avg=3.46 [8 | 164.12] loss=3.33 avg=3.45 [9 | 187.81] loss=3.44 avg=3.45 [10 | 212.46] loss=3.41 avg=3.44 [11 | 238.91] loss=3.35 avg=3.43 [12 | 265.70] loss=3.07 avg=3.40 [13 | 286.85] loss=3.36 avg=3.40 [14 | 309.50] loss=3.32 avg=3.39 [15 | 327.70] loss=3.26 avg=3.38 [16 | 344.01] loss=3.22 avg=3.37 [17 | 358.19] loss=3.41 avg=3.37 [18 | 371.93] loss=2.95 avg=3.35 [19 | 386.32] loss=3.19 avg=3.34 [20 | 400.90] loss=3.51 avg=3.35 [21 | 415.34] loss=3.06 avg=3.33 [22 | 430.17] loss=3.47 avg=3.34 [23 | 444.54] loss=3.06 avg=3.33
forrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source libifcoremd.dll 00007FFD7D033B58 Unknown Unknown Unknown KERNELBASE.dll 00007FFDC9D6B443 Unknown Unknown Unknown KERNEL32.DLL 00007FFDCC487034 Unknown Unknown Unknown ntdll.dll 00007FFDCC5BD241 Unknown Unknown Unknown
What should I do?
I have downloaded and deleted the file 7 times
Assuming you're using the train.py from nsheppard's fork, try running it with --save_every N
where N is the number of steps before it auto-saves (default 1000).
For example: python train.py --dataset data.npz --save_every 10
Traceback (most recent call last): File "interactive_conditional_samples.py", line 89, in
fire.Fire(interact_model) File "H:\Anaconda\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 471, in Fire target=component.name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 681, in CallAndUpdateTrace component = fn(*varargs, **kwargs) File "interactive_conditional_samples.py", line 45, in interact_model enc = encoder.get_encoder(model_name) File "U:\gpt-2\gpt-2\encoder.py", line 110, in get_encoder encoder = json.load(f) File "H:\Anaconda\lib\json_init.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "H:\Anaconda\lib\json_init.py", line 348, in loads return _default_decoder.decode(s) File "H:\Anaconda\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "H:\Anaconda\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
A new error has popped up in place of it
Did you change the "data.npz" to point to where your dataset is? Or better yet, try running the same train.py command as in the original post and just add --save_every 10
to the end of that.
Actually, I collected all the file in one single folder. And when I run the command which you are suggesting, it gives an error related to the JSON file. The one I have mentioned above.
As you say, i want to have a question for it that checkpoints have a or some .ckpt files?