gpt-2 icon indicating copy to clipboard operation
gpt-2 copied to clipboard

Checkpoint not generating

Open Gurkiratsinghk opened this issue 3 years ago • 6 comments

I ran the train.py program of GPT-2 on a txt training data which has 3 stories. I used the 117M parameters model, and it runs, it trains the model, but once it stops it creates checkpoint folder inside it is run1 folder, but none of these files are generated:

  • checkpoint
  • model-xxx.data-00000-of-00001
  • model-xxx.index
  • model-xxx.meta

Use standard file APIs to check for files with this prefix. Loading dataset... 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.27it/s]

dataset has 12863 tokens

Training...

[1 | 22.35] loss=3.69 avg=3.69 [2 | 40.40] loss=3.48 avg=3.58 [3 | 72.00] loss=3.34 avg=3.50 [4 | 91.34] loss=3.45 avg=3.49 [5 | 111.14] loss=3.32 avg=3.45 [6 | 130.68] loss=3.63 avg=3.48 [7 | 146.00] loss=3.35 avg=3.46 [8 | 164.12] loss=3.33 avg=3.45 [9 | 187.81] loss=3.44 avg=3.45 [10 | 212.46] loss=3.41 avg=3.44 [11 | 238.91] loss=3.35 avg=3.43 [12 | 265.70] loss=3.07 avg=3.40 [13 | 286.85] loss=3.36 avg=3.40 [14 | 309.50] loss=3.32 avg=3.39 [15 | 327.70] loss=3.26 avg=3.38 [16 | 344.01] loss=3.22 avg=3.37 [17 | 358.19] loss=3.41 avg=3.37 [18 | 371.93] loss=2.95 avg=3.35 [19 | 386.32] loss=3.19 avg=3.34 [20 | 400.90] loss=3.51 avg=3.35 [21 | 415.34] loss=3.06 avg=3.33 [22 | 430.17] loss=3.47 avg=3.34 [23 | 444.54] loss=3.06 avg=3.33

forrtl: error (200): program aborting due to control-C event

Image PC Routine Line Source libifcoremd.dll 00007FFD7D033B58 Unknown Unknown Unknown KERNELBASE.dll 00007FFDC9D6B443 Unknown Unknown Unknown KERNEL32.DLL 00007FFDCC487034 Unknown Unknown Unknown ntdll.dll 00007FFDCC5BD241 Unknown Unknown Unknown

What should I do?

Gurkiratsinghk avatar Feb 27 '21 22:02 Gurkiratsinghk

I have downloaded and deleted the file 7 times

Gurkiratsinghk avatar Feb 27 '21 22:02 Gurkiratsinghk

Assuming you're using the train.py from nsheppard's fork, try running it with --save_every N where N is the number of steps before it auto-saves (default 1000).

For example: python train.py --dataset data.npz --save_every 10

jaimu97 avatar Mar 04 '21 20:03 jaimu97

Traceback (most recent call last): File "interactive_conditional_samples.py", line 89, in fire.Fire(interact_model) File "H:\Anaconda\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 471, in Fire target=component.name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 681, in CallAndUpdateTrace component = fn(*varargs, **kwargs) File "interactive_conditional_samples.py", line 45, in interact_model enc = encoder.get_encoder(model_name) File "U:\gpt-2\gpt-2\encoder.py", line 110, in get_encoder encoder = json.load(f) File "H:\Anaconda\lib\json_init.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "H:\Anaconda\lib\json_init.py", line 348, in loads return _default_decoder.decode(s) File "H:\Anaconda\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "H:\Anaconda\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

A new error has popped up in place of it

Gurkiratsinghk avatar Mar 06 '21 03:03 Gurkiratsinghk

Did you change the "data.npz" to point to where your dataset is? Or better yet, try running the same train.py command as in the original post and just add --save_every 10 to the end of that.

jaimu97 avatar Mar 07 '21 19:03 jaimu97

Actually, I collected all the file in one single folder. And when I run the command which you are suggesting, it gives an error related to the JSON file. The one I have mentioned above.

Gurkiratsinghk avatar Mar 08 '21 03:03 Gurkiratsinghk

As you say, i want to have a question for it that checkpoints have a or some .ckpt files?

JXCrazy avatar Aug 04 '22 03:08 JXCrazy