Andrej

Results 373 comments of Andrej

I pushed the tokenizer I had in mind here c165855 . But I appreciate the work on this! Closing.

I think we'll want to break this up into chunks, a lot of really good stuff here.

Not an Issue, more a Discussion. And +1 to @chsasank , this is training code not inference code, and not chat.

The formula is out of PaLM paper. It's been a while that I looked at this but my initial reaction is that the additional *2 comes from relative positional embeddings...

I think it's because the first few iterations are super easy gains and after that they get much less easy. Also the dataset is super tiny, so it may be...

This looks like a really extended PR... I am certainly interested in the tiniest, most minimal thing that makes this compile on Windows.

Idea: We create a `doc/windows.md` where we document how to build on Windows, but it's just instructions (and maybe some code snippets) in Markdown. Because it looks very involved right...