Andrew Marble comments

Results 16 comments of


                                            Andrew Marble

loss nan after several epochs

How are you initializing your weights and cached states? I found that the numbers blew up for me like this when I started from a random initialization of the cached...

device-side assert with long inputs & huggingface generators

I'm seeing this too, how do I work around it? Do I just have to skip the offending probe?

Performance

Thanks for taking a look. I changed it so the model can be specified on the command line now. With the 42M model I get about 67 tokens/sec on my...

Performance

With `gfortran llama2.f90 -o llm` I get 166 tokens/sec and with `gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm` I get 69 tokens/sec. Not sure what's going on. This is...

Performance

Current numbers (for 110M model now, Ubuntu on my Thinkpad): llama2.f90 with `gfortran llama2.f90 -o llm -fexternal-blas -lblas`: 28 tok/s (this is the fastest I can get with the different...

Performance

OK, I added a handwritten matmul like in llama2.c. Now, unless I missed something, compiling with `gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm` gives me 38 tok/s on the...

Performance

I installed gfortran-10 and I get the same speed. Still going to investigate different compiler options and I will write a BLAS matmul. I had compiled llama2.c with gcc 9.4....

Performance

@certik FYI I was able to get more speedup by writing element-wise functions that compute the q,k,v projections together at the beginning of the transformer and the MLP+nonlinearity part at...

Performance

Re above, I hadn't parallelized everything I could and now I can get e.g. 2.48 tok/sec vs 2.82 tok/s for llama.cpp on a 3B model (+/-, those are the numbers...

Performance

A couple notes: I had previously been using a lookup table for FP16->real(4) conversion as it was faster than calling the C function I had been using. In my experiments,...