llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

some Rust error

Open nyck33 opened this issue 1 year ago • 0 comments

(llmc-env) nyck33@lenovo-gtx1650:/mnt/d/ML/llm.c$ OMP_NUM_THREADS=8 ./train_gpt2
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356082 (took 11221.492729 ms)
step 1: train loss 4.300639 (took 10770.195235 ms)
step 2: train loss 4.623087 (took 10621.310078 ms)
step 3: train loss 4.599362 (took 10596.792720 ms)
step 4: train loss 4.616664 (took 10895.748351 ms)
step 5: train loss 4.231427 (took 10596.324310 ms)
step 6: train loss 3.753161 (took 10504.295265 ms)
step 7: train loss 3.650458 (took 11034.112917 ms)
step 8: train loss 4.182242 (took 10612.828309 ms)
step 9: train loss 4.199580 (took 10545.814677 ms)
val loss 4.426364
step 10: train loss 4.288661 (took 10601.000062 ms)
step 11: train loss 3.560642 (took 10510.856707 ms)
step 12: train loss 3.731437 (took 10538.586079 ms)
step 13: train loss 4.158511 (took 10577.684063 ms)
step 14: train loss 3.885633 (took 10703.429353 ms)
step 15: train loss 3.766486 (took 10607.367802 ms)
step 16: train loss 4.144007 (took 10580.083612 ms)     
step 17: train loss 3.961167 (took 10524.691744 ms)     
step 18: train loss 3.796044 (took 10518.218191 ms)     
step 19: train loss 3.371042 (took 10600.572562 ms)     
val loss 4.250554
generated: 50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26
step 20: train loss 3.882789 (took 11224.717460 ms)

then I copy paste that into the python tiktoken code

import tiktoken
enc = tiktoken.get_encoding("gpt2")
ptok = lambda x: print(enc.decode(list(map(int, x.strip().split()))))
ptok("50256 16773 18162 21986 11 198 13681 263 23875 198 3152 262 11773 2910 198 1169 6002 6386 2583 286 262 11858 198 20424 428 3135 7596 995 3675 13 198 40 481 407 736 17903 11 329 703 6029 706 4082 198 42826 1028 1128 633 263 11 198 10594 407 198 2704 454 680 1028 262 1027 28860 286 198 3237 323")

ptok("50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26")

That first one is from the Readme so it works but the second one I copied from my terminal which throws:

(llmc-env) nyck33@lenovo-gtx1650:/mnt/d/ML/llm.c$ python decode.py
<|endoftext|>Come Running Away,
Greater conquer
With the Imperial blood
the heaviest host of the gods
into this wondrous world beyond.
I will not back thee, for how sweet after birth
Netflix against repounder,
will not
flourish against the earlocks of
Allay
thread '<unnamed>' panicked at src/lib.rs:201:64:
no entry found for key
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/mnt/d/ML/llm.c/decode.py", line 6, in <module>
    ptok("50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26")
  File "/mnt/d/ML/llm.c/decode.py", line 3, in <lambda>
    ptok = lambda x: print(enc.decode(list(map(int, x.strip().split()))))
  File "/home/nyck33/miniconda3/envs/llmc-env/lib/python3.9/site-packages/tiktoken/core.py", line 258, in decode
    return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
pyo3_runtime.PanicException: no entry found for key
  1. Why am I not getting the same tokens as the Readme?
  2. What is that Rust error?

nyck33 avatar Apr 13 '24 06:04 nyck33